Text2SQL: How LLMs Convert Natural Language Into SQL Queries

Introduction

Text2SQL is a transformative AI technology that converts natural language questions into executable SQL queries, eliminating the need for database expertise. As of 2024-2025, breakthroughs in Retrieval-Augmented Generation (RAG), prompt engineering techniques (DIN-SQL, DAIL-SQL), and self-correction mechanisms have pushed accuracy to 87.6%. Major enterprises like Daangn Pay, IBM, and AWS have deployed Text2SQL in production systems, fundamentally democratizing data access across organizations.

TL;DR

Text2SQL automatically generates SQL queries from natural language questions. When a user asks in plain English—“What was our highest-revenue month last year?"—an LLM produces the corresponding SQL, fetches results from the database, and returns the answer. Recent advances in RAG technology and prompt engineering (DIN-SQL, DAIL-SQL) combined with self-correction mechanisms have achieved 87.6% execution accuracy on the Spider benchmark. Enterprise deployments by Daangn Pay and AWS demonstrate real-world impact on decision-making speed and data literacy. However, challenges remain in handling complex multi-table joins, domain-specific terminology, and schema hallucination—requiring custom fine-tuning per organization.

How Text2SQL Works: Architecture & Mechanisms

2.1 The Processing Pipeline

Text2SQL systems follow a structured pipeline:

Natural Language Input — User poses a question in plain language
- Example: “Show me all employees in the Marketing department who are 30 years old or older”
Intent Analysis — LLM parses the question to understand intent and extract key entities
Schema Linking — System identifies which database tables and columns are relevant - Maps “Marketing department” → employees.department
- Maps “30 years or older” → employees.age >= 30
SQL Generation — LLM generates the SQL query based on schema and context
Self-Correction & Validation — System detects syntax errors, semantic inconsistencies, and data-type mismatches; automatically refines the query
Database Execution — Query runs against the database
Result Presentation — Visualized as tables, charts, or natural language summaries

Why it matters: This multi-stage approach reduces hallucinations and improves reliability compared to single-pass generation.

2.2 The Role of Large Language Models (LLMs)

Modern Text2SQL relies on pre-trained large language models that understand both natural language and SQL syntax. The quality of SQL generation depends on:

Code training data — Models trained on large code corpora (e.g., Codex, Code-LLaMA) perform better on SQL generation
Context window size — Larger context allows more schema information and examples
Reasoning capability — Chain-of-Thought (CoT) prompting helps break down complex queries

LLM Vendor Comparison (2025):

Feature	Claude 3.5/3.7 Sonnet	GPT-4o	Gemini 2.5 Pro
SQL/Coding Quality	Excellent (senior engineer-level)	Excellent (prototyping-focused)	Good
Context Window	200,000 tokens	128,000 tokens	Large
Complex Multi-table Joins	Superior	Good	Good
API Cost (input)	$3/million tokens	Higher	Competitive
Speed	Moderate-Fast	Very Fast	Fast

Key insight: Claude’s larger context window (200,000 tokens) is advantageous for complex schemas with many tables and columns, while GPT-4o excels in speed and image-based tasks. For domain-specific or heavily normalized databases (10+ tables), Claude 3.5+ is recommended; for simpler queries, GPT-3.5-Turbo suffices.

Why it matters: Choice of LLM directly impacts accuracy, latency, and cost. A poorly selected model can drive accuracy down to 60% even on moderate queries.

Advanced Techniques: RAG, Schema Linking & Self-Correction

3.1 Retrieval-Augmented Generation (RAG)

Traditional approach (pre-2024): Embed entire database schema into the prompt—leading to token overflow, information loss, and decreased accuracy. RAG approach (2024+): Store database metadata in a vector database, retrieve only relevant schema for the user’s question. How it works:

Metadata Extraction — Table names, column definitions, data types, sample rows → Vector embeddings
Dense Search — User question → Embedding → Cosine similarity search
Selective Context — Only top-K relevant tables/columns inserted into LLM prompt
Result: 30-50% reduction in token usage while improving accuracy

HEARTCOUNT ABI Example (2024):

Uses OpenAI’s text-embedding-3-large (reduced to 768 dimensions for efficiency)
Applies cosine similarity to capture semantic meaning rather than vector magnitude
Supports multilingual queries — single language training data works for other languages due to semantic alignment

Why it matters: RAG enables scaling to enterprise-scale databases (100+ tables, 1000+ columns) without accuracy degradation.

3.2 Prompt Engineering: DIN-SQL & DAIL-SQL

DIN-SQL (Decomposed In-Context Learning)

Problem: Text-to-SQL is inherently complex; generic prompting fails on hard queries.

Solution: Decompose the task into 4 sub-stages, using Chain-of-Thought (CoT) examples for each:

Schema Linking (10 CoT examples) — Identify relevant tables/columns
Classification & Decomposition (10 CoT examples) — Classify query difficulty, break into sub-tasks
SQL Generation (10 CoT examples) — Generate SQL per sub-task
Self-Correction (Zero-shot) — Fix errors without examples

Results: Achieved SOTA on Spider benchmark, execution accuracy 91%+. Successfully handles joins, nested queries, and aggregations that simpler prompts fail on.

DAIL-SQL (Data-Aware In-Context Learning)

Innovation: Intelligently selects few-shot examples by combining two retrieval strategies:

Query Similarity Selection — First, generate a draft SQL, then find training queries with similar SQL structure
Masked Question Similarity Selection — From those results, select questions most semantically similar to the user’s question
Example Organization — Test three formats: Full-Information, SQL-Only, DAIL

Result: Reduced context size while maintaining or improving accuracy compared to full schema dumps.

Why it matters: Prompt engineering techniques reduce LLM errors by 10-20% without retraining, making them cost-effective for enterprises.

3.3 Self-Correction & Multi-Stage Validation

Generated SQL frequently contains errors. Modern systems implement automatic validation and repair loops: Daangn Pay’s “BroQuery” Architecture (2025):

Intent Analysis — LLM determines if SQL generation is needed or if a different response is appropriate
Context Enrichment — Use hybrid search (keyword + semantic) via Amazon OpenSearch + Titan Embeddings
SQL Generation — LLM creates SQL with rich context
Self-Reflection — Rule-based validation:
- Syntax check via SQL parser
- Data-type validation
- Semantic reasonableness checks
Self-Correction Loop — If errors detected, regenerate (max N retries)
Result Visualization — Tables, charts, or summaries returned to user via Slack

Implemented using LangGraph to orchestrate complex branching workflows.

Text2SQL Error Taxonomy & Solutions

Common Errors & Frequency

Error Type	Description	Frequency	Solution
Schema Errors	Non-existent tables/columns, typos, hallucination	High	RAG + Schema validation
Syntax Errors	Invalid SQL syntax, wrong function usage	Medium	SQL parser + Self-correction
Semantic Errors	Misinterpretation of user intent	High	Better CoT prompting + examples
Value Errors	Wrong WHERE/HAVING conditions	Medium	Data-aware few-shot examples
Skeleton Errors	Wrong SELECT/JOIN/GROUP BY structure	High	Decomposition + validation
Aggregation Errors	Incorrect COUNT, SUM, etc.	Low	Schema-aware hints
Sorting/Limiting Errors	ORDER BY, LIMIT mistakes	Low	Post-generation checks

Current Accuracy Improvements:

Spider Benchmark: PET-SQL achieved 87.6% execution accuracy (vs. 86.5% for previous SOTA)
DIN-SQL: 91%+ on certain subsets
Real-world deployments: 75-85% due to domain-specific complexities

Why it matters: Even with advanced techniques, Text2SQL is not 100% reliable. Organizations must implement human-in-the-loop review for critical queries.

Production Deployments: Real-World Case Studies

Daangn Pay “BroQuery” (June 2025)

Scale: Hundreds of tables, thousands of columns across Redshift data warehouse

Technical Stack:

LLM Backend: Amazon Bedrock (Claude models)
Vector Store: Amazon OpenSearch with text-embedding-3-large
Metadata Management: Custom schema enrichment pipeline
Orchestration: LangGraph for agent workflows
Execution: Amazon Redshift, Slack integration

Key Innovations:

ENUM Value Enrichment — Extract enum meanings from application source code and internal wikis, not just the database
- Example: payment_status = 1 automatically documented as “Payment Completed”
Business Glossary Integration — Centralized definitions of metrics
- Example: “MAU” = “Unique users who opened app in last 30 days, login-based, iOS/Android combined, as of daily batch”
Hybrid Search — Combine lexical (keyword) + semantic (vector) search for 20-30% accuracy boost
Conversation Context — Store previous Q&A pairs in DynamoDB for multi-turn understanding
- Example: User asks “monthly revenue?” → system stores context → user follows with “same period last year?” → system understands temporal reference

Results: Data analysis requests surged post-deployment; decision-making latency reduced from days to minutes.

AWS/Databricks Text2SQL Initiative (2024)

Provides evaluation toolkit benchmarked against Spider dataset
Focuses on performance optimization for large-scale analytics

IBM Text2SQL (August 2025)

Announced AI-powered Text2SQL to democratize data access
Emphasis on enterprise security and governance

Why it matters: Enterprise deployments reveal that technical excellence ≠ business success. Success requires data governance maturity: schema documentation, business term definitions, metadata quality, and organizational buy-in.

Benchmarks: Spider, BIRD & Beyond

Spider Benchmark (Standard)

Dataset: 200 databases, 8,659 training samples, 1,034 dev, 2,147 test samples
Evaluation Metric: Execution Accuracy (EX) — does the query’s result match ground truth?
Current SOTA:
- PET-SQL: 87.6%
- DIN-SQL: 91%+

BIRD Benchmark (Realism-Focused)

Simulates real-world scenarios with noisy schema names and complex joins
Emphasizes nested queries and multi-table operations

Benchmark Limitations

Reddit community debate (October 2024):

“Existing Text-to-SQL benchmarks, including Spider, fail to reflect real-world performance. High benchmark scores often indicate overfitting; production systems frequently perform 10-20% worse.”

Key insight: Benchmark accuracy ≠ Production accuracy. Organizations must build custom gold-standard datasets and evaluate continuously.

Why it matters: A system scoring 85% on Spider might achieve only 65% on your specific database due to domain-specific terminology, schema complexity, and data quality differences.

Practical Implementation: Python + LangChain

4.1 Installation & Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Install dependencies
!pip install psycopg2-binary sqlalchemy
!pip install -U langchain langchain-community langchain-openai

from sqlalchemy import create_engine, text
from langchain.sql_database import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
import pandas as pd

# Connect to PostgreSQL database
hostname = 'localhost'
username = 'your_user'
password = 'your_password'
port = '5432'
database = 'your_database'

connection_url = f"postgresql://{username}:{password}@{hostname}:{port}/{database}"
engine = create_engine(connection_url)
db = SQLDatabase.from_uri(connection_url)

4.2 Inspect Database Schema

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# List all tables
tables = db.get_table_names()
print(f"Available tables: {tables}")

# Inspect specific table structure
schema_info = db.get_table_info(table_names=['employees'])
print(schema_info)

# Output example:
# CREATE TABLE employees (
#     id SERIAL NOT NULL,
#     name VARCHAR NOT NULL,
#     age INTEGER NOT NULL,
#     department VARCHAR NOT NULL,
#     salary NUMERIC(10,2),
#     hire_date DATE,
#     CONSTRAINT employees_pkey PRIMARY KEY (id)
# )
#
# 3 rows from employees table:
# id | name | age | department | salary
# 1  | John Doe | 28 | Engineering | 95000.00
# 2  | Jane Smith | 35 | Marketing | 78000.00
# 3  | Bob Johnson | 42 | Finance | 105000.00

4.3 Generate SQL from Natural Language

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Initialize LLM (start with GPT-3.5-Turbo for cost; upgrade to GPT-4 for complex queries)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create Text-to-SQL chain
chain = create_sql_query_chain(llm, db)

# Test with natural language question
question = "How many employees in each department?"
generated_sql = chain.invoke({"question": question})

print("Generated SQL:")
print(generated_sql)

# Output:
# SELECT department, COUNT(*) as employee_count
# FROM employees
# GROUP BY department
# ORDER BY employee_count DESC;

4.4 Execute & Retrieve Results

1
2
3
4
5
6
7
8
9
# Execute generated SQL
df = pd.read_sql(text(generated_sql), engine)
print(df)

#     department  employee_count
# 0 Engineering               150
# 1 Marketing                 85
# 2 Finance                   62
# 3 Operations                110

4.5 Production-Ready Pattern: Multi-Query Testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
test_questions = [
    "Show me the top 5 earners",
    "Which department hired the most people in 2024?",
    "What's the average age by department?",
    "List all employees hired after 2020 in Engineering"
]

for question in test_questions:
    try:
        sql = chain.invoke({"question": question})
        result_df = pd.read_sql(text(sql), engine)
        
        print(f"\n✓ Question: {question}")
        print(f"SQL: {sql}\n")
        print(result_df.head())
        
    except Exception as e:
        print(f"\n✗ Question: {question}")
        print(f"Error: {e}\n")

4.6 Adding Few-Shot Examples for Complex Queries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# For complex queries, provide few-shot examples
from langchain.prompts import PromptTemplate
from langchain.chains.sql_database.prompt import PROMPT

# Extend prompt with examples
few_shot_examples = """
Example 1:
Question: What is the average salary by department?
SQL: SELECT department, AVG(salary) FROM employees GROUP BY department;

Example 2:
Question: Show employees from different departments who joined after 2020
SQL: SELECT * FROM employees WHERE hire_date > '2020-12-31' ORDER BY department;
"""

# Use modified prompt (advanced—requires custom prompt engineering)

Production Recommendations:

Start with GPT-3.5-Turbo for simple queries (~80% accuracy)
Upgrade to GPT-4 or Claude 3.5 for complex joins (3+ tables)
Implement schema documentation in table/column comments
Build custom evaluation set (20-50 representative queries) and track accuracy over time
Use LangGraph for production orchestration; avoid simple chains for production

Why it matters: Production systems require error handling, monitoring, and continuous refinement—not just basic Text-to-SQL chains.

Enterprise Challenges & Solutions

Remaining Challenges (2025)

Schema Hallucination — LLM invents non-existent tables or columns
- Solution: Strict schema validation before execution
Complex Joins — 5+ table joins see significant accuracy drop
- Solution: Decomposition-based prompting (DIN-SQL), schema pre-filtering
Domain Terminology — Medical/legal/financial jargon misinterpretation
- Solution: Custom glossary integration, domain-specific fine-tuning
Multilingual Support — Non-English queries underperform English
- Solution: Multilingual embeddings (text-embedding-3-large supports 100+ languages)
Real-time Constraints — Latency requirements (sub-second queries)
- Solution: Query caching, speculative decoding, precomputed materialized views

Emerging Solutions

Schema Retrieval Optimization: Separate indices for tables, columns, and few-shot examples; use FAISS for dense search
Adaptive Prompting: Auto-adjust prompt complexity based on query difficulty
Supervised Fine-Tuning (SFT): Fine-tune models on organization-specific datasets for 5-10% accuracy gains
Data Augmentation: Text2SQL-Flow framework generates diverse training data from seed examples
Cost Prediction: Pre-execution query cost estimation to prevent expensive operations
Autonomous Data Governance: Auto-validate schema changes, metadata consistency

Why it matters: Industry-specific and organization-specific solutions are becoming essential. Generic Text2SQL models plateau at 80-85% accuracy; reaching 90%+ requires domain customization.

Future Directions & 2025+ Outlook

Technological Evolution

LLM Improvements:

Claude 3.7 Sonnet (2025): 200K context, superior SQL reasoning
GPT-4o Advanced: Real-time streaming, better multimodal reasoning
Gemini 2.5 Pro: Competitive multi-hop reasoning
Open-weight alternatives: DeepSeek, Llama, Mixtral gaining traction for cost-conscious enterprises

Architecture Enhancements:

Adaptive Few-shot Learning: Model dynamically selects examples per query difficulty
Multimodal Understanding: Parse ERD diagrams, photos of whiteboards
Federated Learning: Improve models across organizations without sharing raw data
Autonomous SQL Agents: Move beyond query generation to multi-step analysis automation

Expected Impact (Next 12-24 Months)

Accuracy: 95%+ on public benchmarks; 85-90% in production (with custom fine-tuning)

Time-to-Insight: From days → minutes for routine analyses

User Adoption: Every employee capable of self-service analytics → data literacy gap closes

Cost: 50-80% reduction in analytics team burden

New Business Models: Text-to-SQL SaaS platforms proliferate; vertical-specific solutions (FinStat2SQL for finance, MedQuery for healthcare)

Why It Matters

Text2SQL is not merely a technical convenience—it represents a paradigm shift in data accessibility. Organizations that master Text2SQL by 2026 will out-compete those relying on traditional BI/analytics teams. The bottleneck shifts from “Who can query the database?” to “Who can ask the right questions?”

Conclusion

Text2SQL has transitioned from research curiosity to enterprise necessity. Driven by RAG technology, advanced prompt engineering (DIN-SQL, DAIL-SQL), and self-correction mechanisms, modern systems achieve 87.6% execution accuracy—sufficient for many use cases. Deployment by Daangn Pay, AWS, and IBM validates business viability.

Key Takeaways:

Natural Language ≠ Automatic Excellence: Text2SQL requires careful architecture, including schema linking, semantic search, and validation layers
LLM Choice Matters: Claude excels on complex schemas; GPT-4o on speed; choose wisely
Benchmark ≠ Production: Custom evaluation datasets and continuous monitoring are essential
Data Governance is Prerequisite: Schema documentation, glossaries, and metadata quality directly impact accuracy
Domain Customization Pays Off: Generic models plateau; fine-tuning on organization-specific data yields 5-10% gains
Future is Agentic: Next-generation systems will move beyond query generation to autonomous multi-step analysis

Recommended Next Steps for Organizations:

Pilot Phase: Deploy on 2-3 non-critical tables; establish baseline accuracy
Governance: Build data glossary, document schema rigorously
Fine-tuning: Collect 100+ human-validated query pairs; fine-tune on Claude or open-weight models
Production Rollout: Implement with LangGraph, monitoring, and fallback mechanisms
Continuous Improvement: Monthly retraining on production queries; quarterly model upgrades

Summary

Definition: Text2SQL converts natural language queries into executable SQL
Technology Stack: LLMs (Claude, GPT-4), RAG, vector databases, prompt engineering
Performance: 87.6% accuracy on benchmarks; 80-90% in real-world deployments
Enterprises Deploying: Daangn Pay, AWS, IBM, Google
Key Advantage: Data democratization; non-technical users become self-sufficient analysts
Remaining Challenges: Complex joins, domain terminology, schema hallucination
Future: Multimodal AI, autonomous agents, federated learning

Recommended Hashtags

#Text2SQL #LLM #DataEngineering #RAG #SQLGeneration #NLP #DataDemocratization #LangChain #PromptEngineering #AI

References

Text-to-SQL이 뭔가요?
HEARTCOUNT Blog | 2025-07-23
https://www.heartcount.io/blog/text-to-sql
더 정확해진 Text-to-SQL 2.0 근데 이제 RAG를 곁들인…
HEARTCOUNT Blog | 2025-07-23
https://www.heartcount.io/blog/text-to-sql-2-rag
LLM 기반의 Text-to-SQL: A Survey
데로스요초 | 2024-08-28
https://derosyocho.tistory.com/
Text2SQL은 왜 자꾸 틀릴까?
Brunch | 2025-08-29
https://brunch.co.kr/
데이터브릭스에서 손쉽게 Text2SQL 성능 개선하기
Databricks | 2024
https://www.databricks.com/
텍스트를 SQL로 자동변환 해주는 기술, Text To SQL
HEARTCOUNT Blog | 2025-07-23
https://www.heartcount.io/blog/text-to-sql-tech
[PET-SQL 논문리뷰] Prompt 설정과 스키마 인식을 통한 향상된 SQL 쿼리 생성
X2bee Tistory | 2024-04-03
https://x2bee.tistory.com/
매일 새로운 기술이 쏟아지는 요즘, Text-to-SQL은 어떻게
HEARTCOUNT Blog | 2025-07-23
https://www.heartcount.io/blog/text-to-sql-embedding
IBM, 데이터 접근의 민주화와 가속화를 위한 AI 기반 Text2SQL 출시
IBM | 2025-08-27
https://www.ibm.com/
Text 2 SQL 벤치마크
Reddit | 2024-10-26
https://www.reddit.com/r/MachineLearning/
[Python/Langchain 기초] Text To SQL 간단하게 구현하기
Velog | 2024
https://velog.io/
NL2SQL (Text-to-SQL) Agents 설계
Brunch | 2025-09-06
https://brunch.co.kr/
Text2SQL-Flow 논문 리뷰
Themoonlight | 2025-11-14
https://themoonlight.tistory.com/
당근페이의 Amazon Bedrock 기반 Text-to-SQL
AWS Blog | 2025-06-26
https://aws.amazon.com/ko/blogs/tech/daangnpay-text-to-sql-amazon-bedrock/
Decomposed In-Context Learning of Text-to-SQL (DIN-SQL)
YouTube | 2024-01-29
https://www.youtube.com/
클로드 vs. ChatGPT 비교
MakeBot | 2025-11-25
https://makebot.ai/
Bird Benchmark의 Text-to-SQL 논문 정리 (DAIL-SQL)
데로스요초 | 2024-10-08
https://derosyocho.tistory.com/
2025년 최신 AI 모델 성능 분석
Tilnote | 2025-04-10
https://tilnote.io/
당근페이의 Amazon Bedrock 기반 Text-to-SQL (2부)
AWS Blog | 2025-06-26
https://aws.amazon.com/ko/blogs/tech/daangnpay-text-to-sql-amazon-bedrock-part2/
클로드 vs ChatGPT 코딩 대결
GLB GPT | 2025-11-22
https://glbgpt.com/

Introduction#

TL;DR#

How Text2SQL Works: Architecture & Mechanisms#

2.1 The Processing Pipeline#

2.2 The Role of Large Language Models (LLMs)#

Advanced Techniques: RAG, Schema Linking & Self-Correction#

3.1 Retrieval-Augmented Generation (RAG)#

3.2 Prompt Engineering: DIN-SQL & DAIL-SQL#

DIN-SQL (Decomposed In-Context Learning)#

DAIL-SQL (Data-Aware In-Context Learning)#

3.3 Self-Correction & Multi-Stage Validation#

Text2SQL Error Taxonomy & Solutions#

Common Errors & Frequency#

Production Deployments: Real-World Case Studies#

Daangn Pay “BroQuery” (June 2025)#

AWS/Databricks Text2SQL Initiative (2024)#

IBM Text2SQL (August 2025)#

Benchmarks: Spider, BIRD & Beyond#

Spider Benchmark (Standard)#

BIRD Benchmark (Realism-Focused)#

Benchmark Limitations#

Practical Implementation: Python + LangChain#

4.1 Installation & Setup#

4.2 Inspect Database Schema#

4.3 Generate SQL from Natural Language#

4.4 Execute & Retrieve Results#

4.5 Production-Ready Pattern: Multi-Query Testing#

4.6 Adding Few-Shot Examples for Complex Queries#

Enterprise Challenges & Solutions#

Remaining Challenges (2025)#

Emerging Solutions#

Future Directions & 2025+ Outlook#

Technological Evolution#

Expected Impact (Next 12-24 Months)#

Why It Matters#

Conclusion#

Summary#

Recommended Hashtags#

References#