Data & Analytics

Text-to-SQL analyst assistant

Convert questions into SQL and explain the query.

task.text_to_sqltask.sql_debugging

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gpt-5-2025-08-07

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

25.0%

Best benchmark score

34.3%

Confidence

All ranked models — top 3

🥇

gpt-5-2025-08-07

25.0%

🥈

o3-20250416

21.4%

🥉

deepseek-r1

21.2%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

Spider2.0 Snow Text-to-SQL: snow_text_to_sql_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gpt-5-2025-08-07 Strong on Spider2.0 Snow Text-to-SQL snow_text_to_sql_score_pct and LiveSQLBench success_rate_pct	25.0%	34%	—	Spider2.0 Snow Text-to-SQLLiveSQLBench
🥈	o3-20250416 Strong on Spider2.0 Snow Text-to-SQL snow_text_to_sql_score_pct and LiveSQLBench success_rate_pct	21.4%	34%	$3.50	Spider2.0 Snow Text-to-SQLLiveSQLBench
🥉	deepseek-r1 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy	21.2%	37%	$0.27	DuckDB NSQL LeaderboardDuckDB NSQL Leaderboard
#4	gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct	20.7%	24%	$4.50	Vals Finance AgentFACTS Benchmark Suite
#5	gpt-4o Strong on DuckDB NSQL Leaderboard all_execution_accuracy and JSONSchemaBench Leaderboard medium_schema_compliance_pct	20.3%	41%	$0.26	DuckDB NSQL LeaderboardJSONSchemaBench Leaderboard
#7	claude-sonnet-4 Strong on LiveSQLBench success_rate_pct and Spider2.0 Lite Text-to-SQL lite_text_to_sql_score_pct	19.8%	37%	$6.00	LiveSQLBenchSpider2.0 Lite Text-to-SQL
#8	gpt-4o-20241120 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy	18.8%	35%	—	DuckDB NSQL LeaderboardDuckDB NSQL Leaderboard
#9	qwen-2.5-72b-instruct Strong on DuckDB NSQL Leaderboard all_execution_accuracy and JSONSchemaBench Leaderboard medium_schema_compliance_pct	17.7%	28%	—	DuckDB NSQL LeaderboardJSONSchemaBench Leaderboard
#11	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	17.4%	26%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
#12	gpt-5-mini-2025-08-07 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	16.4%	26%	—	Vals Finance AgentVals CorpFin v2
#13	gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	15.8%	21%	$1.13	Vals CorpFin v2FACTS Benchmark Suite
#14	Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.0%	22%	—	Vals CorpFin v2Vals Finance Agent
#15	gemini-3-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.8%	20%	$4.50	Vals Finance AgentVals CorpFin v2
#16	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	14.6%	17%	—	FACTS Benchmark SuiteVals CorpFin v2
#17	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	14.6%	21%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#18	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.4%	18%	$6.00	Vals Finance AgentVals CorpFin v2
#19	Claude-3.5-Sonnet Strong on DuckDB NSQL Leaderboard all_execution_accuracy and LLM-AggreFact Leaderboard average_score_pct	13.9%	23%	$6.00	DuckDB NSQL LeaderboardLLM-AggreFact Leaderboard
#21	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	13.6%	17%	—	Vectara HHEM LeaderboardVals CorpFin v2
#22	gpt-4o-2024-08-06 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and Vectara HHEM Leaderboard overall_hallucination_error_pct	13.5%	27%	—	DuckDB NSQL LeaderboardVectara HHEM Leaderboard
#25	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	13.0%	19%	—	FACTS Benchmark SuiteVals CorpFin v2
#26	deepseek-v3 Strong on LiveSQLBench success_rate_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	12.9%	30%	—	LiveSQLBenchVectara HHEM Leaderboard
#27	gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	12.8%	19%	—	Vectara HHEM LeaderboardVals CorpFin v2
#28	gpt-4o-mini-2024-07-18 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy	12.6%	24%	—	DuckDB NSQL LeaderboardDuckDB NSQL Leaderboard
#30	gpt-4.1 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy	12.3%	18%	$3.50	DuckDB NSQL LeaderboardDuckDB NSQL Leaderboard
#31	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.3%	19%	—	Vals Finance AgentVals CorpFin v2
#32	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	12.2%	23%	$0.28	Vals CorpFin v2Vals Finance Agent
#35	claude-opus-4-6 Strong on LiveSQLBench success_rate_pct and AgentSet LLM Leaderboard elo_score	11.6%	14%	$10.00	LiveSQLBenchAgentSet LLM Leaderboard
#36	o4-mini Strong on LiveSQLBench success_rate_pct and Vals CorpFin v2 overall_accuracy_pct	11.1%	22%	$1.93	LiveSQLBenchVals CorpFin v2
#38	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	11.0%	18%	$0.17	FACTS Benchmark SuiteVectara HHEM Leaderboard
#39	Qwen3-32B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and Vectara HHEM Leaderboard overall_hallucination_error_pct	10.7%	18%	—	DuckDB NSQL LeaderboardVectara HHEM Leaderboard

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

42 rows · 1.1% avg lift

Vals Legal Bench

32 rows · 0.3% avg lift

Vals MedQA

32 rows · 0.3% avg lift

Vals Finance Agent

31 rows · 1.0% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.text_to_sqltask.sql_debugging

Required modes

mode.json_schema

Domains

domain.data_analytics_bi

Related in Data & Analytics

SQL debugging

Diagnose and fix SQL queries for correctness and performance.

Metric definition workshop

Turn ambiguous KPI definitions into precise, measurable specs.

Dashboard narratives

Generate weekly KPI narratives and investigation suggestions.

Chart & Data Visualization Interpretation

Reading charts, graphs, and dashboards to extract insights and answer questions.