SQL debugging
Diagnose and fix SQL queries for correctness and performance.
Provisional leader
deepseek-r1
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
24.7%
Best benchmark score
45.4%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
83%
Evidence Points
21
Top Signal
DuckDB NSQL Leaderboard: all_execution_accuracy
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | deepseek-r1 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 24.7% |
| ๐ฅ | gpt-4o-20241120 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 24.4% |
| #4 | gpt-4o Strong on DuckDB NSQL Leaderboard all_execution_accuracy and JSONSchemaBench Leaderboard medium_schema_compliance_pct | 23.6% |
| #6 | o3-20250416 Strong on Spider2.0 Snow Text-to-SQL snow_text_to_sql_score_pct and LiveSQLBench success_rate_pct | 20.4% |
| #7 | gpt-5-2025-08-07 Strong on Spider2.0 Snow Text-to-SQL snow_text_to_sql_score_pct and LiveSQLBench success_rate_pct | 19.7% |
| #8 | qwen-2.5-72b-instruct Strong on DuckDB NSQL Leaderboard all_execution_accuracy and JSONSchemaBench Leaderboard medium_schema_compliance_pct | 18.6% |
| #9 | claude-sonnet-4 Strong on LiveSQLBench success_rate_pct and Spider2.0 Lite Text-to-SQL lite_text_to_sql_score_pct | 18.4% |
| #15 | gpt-4o-mini-2024-07-18 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 14.7% |
| #16 | gpt-4.1 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 14.6% |
| #17 | Claude-3.5-Sonnet Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 14.3% |
| #19 | gpt-4o-2024-08-06 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 13.1% |
| #25 | gemini-2.0-flash-001 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 11.8% |
| #27 | deepseek-v3 Strong on LiveSQLBench success_rate_pct and BIRD-CRITIC success_rate_open_pct | 11.7% |
| #28 | Qwen3-30B-A3B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 11.4% |
| #30 | Llama-3.3-70B-Instruct Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 11.2% |
| #31 | Qwen2.5-Coder-7B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 10.9% |
| #37 | o4-mini Strong on LiveSQLBench success_rate_pct and Aider Polyglot Leaderboard percent_correct_pct | 10.3% |
| #39 | gemma-2-27b-it Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 10.0% |
| #40 | phi-4 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 9.7% |
| #41 | Qwen3-32B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 9.6% |
| #42 | claude-opus-4-6 Strong on LiveSQLBench success_rate_pct and AgentSet LLM Leaderboard elo_score | 9.6% |
| #43 | Phi-3-medium-128k-instruct Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 9.4% |
| #45 | gpt-4.1-20250414 Strong on MMLongBench-Doc Leaderboard acc_score_pct and Galileo Agent Leaderboard v2 Avg AC | 9.0% |
| #46 | QwQ-32B-Preview Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 8.9% |
| #47 | gemini-2.5-pro Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ | 8.5% |
| #49 | Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC | 8.4% |
| #50 | Meta-Llama-3.1-8B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and DuckDB NSQL Leaderboard hard_execution_accuracy | 8.4% |
| #52 | minimax-m2.1 Strong on LiveSQLBench success_rate_pct and Vals SWE-bench overall_accuracy_pct | 8.1% |
| #54 | gemini-3-pro-preview Strong on BFCL Multi-turn Official Multi Turn Acc and Vals Mortgage Tax overall_accuracy_pct | 7.9% |
| #56 | gpt-5-mini-2025-08-07 Strong on Vals MedQA overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct | 7.5% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
38
Sources
8
Quality
Low
DuckDB NSQL Leaderboard
EQ-Bench Leaderboard
Vals Legal Bench
Vals MedQA
Missing frontier models
gpt-5.2-2025-12-11
Thin evidence after weightingRank #9
21.9%
claude-opus-4-5-20251101
Thin evidence after weightingRank #10
18.6%
claude-sonnet-4.6
Thin evidence after weightingRank #11
20.0%
grok-4-1-fast-reasoning
Thin evidence after weightingRank #12
19.5%
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Data & Analytics
Metric definition workshop
Turn ambiguous KPI definitions into precise, measurable specs.
Dashboard narratives
Generate weekly KPI narratives and investigation suggestions.
Chart & Data Visualization Interpretation
Reading charts, graphs, and dashboards to extract insights and answer questions.
Text-to-SQL analyst assistant
Convert questions into SQL and explain the query.