Config debugging
Diagnose and patch YAML/JSON/TOML configs with minimal diffs.
Provisional leader
claude-sonnet-4
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
36.7%
Best benchmark score
49.8%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
82%
Evidence Points
23
Top Signal
Galileo Agent Leaderboard v2: Avg AC
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 36.7% |
| ๐ฅ | gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 31.7% |
| ๐ฅ | gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 29.1% |
| #4 | gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 27.8% |
| #5 | o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OSWorld Official Leaderboard overall_score_pct | 24.4% |
| #6 | gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals MedQA overall_accuracy_pct | 23.6% |
| #7 | gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and FACTS Benchmark Suite facts_grounding_score_pct | 23.0% |
| #8 | gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Mortgage Tax overall_accuracy_pct | 22.7% |
| #9 | Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ | 22.2% |
| #10 | claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Mortgage Tax overall_accuracy_pct | 21.5% |
| #11 | gemini-3.1-pro-preview Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct | 20.1% |
| #12 | gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 19.3% |
| #13 | kimi-k2.5-thinking Strong on OSWorld Official Leaderboard overall_score_pct and Vals CorpFin v2 overall_accuracy_pct | 18.4% |
| #14 | gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ | 16.9% |
| #15 | o4-mini Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 16.8% |
| #16 | Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals MedQA overall_accuracy_pct | 16.8% |
| #17 | Kimi-K2-Instruct Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 16.4% |
| #18 | gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and Vals MedQA overall_accuracy_pct | 16.3% |
| #19 | claude-opus-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 16.3% |
| #20 | gpt-4.1 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and DuckDB NSQL Leaderboard all_execution_accuracy | 15.4% |
| #21 | gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals MedQA overall_accuracy_pct | 15.3% |
| #22 | gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals MedScribe overall_accuracy_pct | 14.9% |
| #23 | claude-sonnet-4.6 Strong on Vals Tax Eval v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 14.8% |
| #24 | gpt-4o Strong on MEGA-Bench app:Planning and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 14.8% |
| #27 | qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ | 14.2% |
| #28 | claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 14.0% |
| #29 | claude-opus-4-6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and AgentSet LLM Leaderboard elo_score | 13.8% |
| #30 | claude-opus-4-5-20251101-thinking Strong on Vals MedQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 13.6% |
| #31 | gemini-3.1-flash-lite-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 13.5% |
| #33 | grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Tax Eval v2 overall_accuracy_pct | 12.4% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
69
Sources
8
Quality
Low
Vals Legal Bench
Vals Tax Eval v2
Vals LiveCodeBench
Vals GPQA
Missing frontier models
No obvious gaps right now.
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in DevOps & SRE
Runbook step assistant
Suggest safe runbook steps and escalation points grounded in docs.
Log triage
Interpret logs and propose safe diagnostic steps.
Kubernetes manifest generation
Generate K8s manifests with safe defaults and probes.
Terraform generation
Generate Terraform IaC with correct resources and safe defaults.