DevOps & SRE

Config debugging

Diagnose and patch YAML/JSON/TOML configs with minimal diffs.

task.config_debugging

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

claude-sonnet-4

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

36.7%

Best benchmark score

49.8%

Confidence

All ranked models — top 3

🥇

claude-sonnet-4

36.7%

🥈

gemini-2.5-pro

31.7%

🥉

gpt-5-2025-08-07

29.1%

Ranked Models

Evidence Quality

82%

Evidence Points

Top Signal

Galileo Agent Leaderboard v2: Avg AC

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	36.7%	50%	$6.00	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
🥈	gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC	31.7%	45%	$3.44	SWE-bench Verified LeaderboardGalileo Agent Leaderboard v2
🥉	gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	29.1%	36%	—	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#4	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	27.8%	38%	—	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#5	o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OSWorld Official Leaderboard overall_score_pct	24.4%	35%	$3.50	SWE-bench Verified LeaderboardOSWorld Official Leaderboard
#6	gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals MedQA overall_accuracy_pct	23.6%	35%	—	SWE-bench Verified LeaderboardVals MedQA
#7	gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and FACTS Benchmark Suite facts_grounding_score_pct	23.0%	27%	—	SWE-bench Verified LeaderboardFACTS Benchmark Suite
#8	gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Mortgage Tax overall_accuracy_pct	22.7%	29%	$4.50	SWE-bench Verified LeaderboardVals Mortgage Tax
#9	Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	22.2%	33%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#10	claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Mortgage Tax overall_accuracy_pct	21.5%	25%	—	SWE-bench Verified LeaderboardVals Mortgage Tax
#11	gemini-3.1-pro-preview Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	20.1%	22%	$4.50	Vals LiveCodeBenchVals SWE-bench
#12	gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	19.3%	29%	—	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#13	kimi-k2.5-thinking Strong on OSWorld Official Leaderboard overall_score_pct and Vals CorpFin v2 overall_accuracy_pct	18.4%	23%	—	OSWorld Official LeaderboardVals CorpFin v2
#14	gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	16.9%	27%	$0.17	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#15	o4-mini Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	16.8%	27%	$1.93	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#16	Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals MedQA overall_accuracy_pct	16.8%	24%	$1.07	SWE-bench Verified LeaderboardVals MedQA
#17	Kimi-K2-Instruct Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	16.4%	22%	—	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#18	gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and Vals MedQA overall_accuracy_pct	16.3%	21%	$1.13	Vals Legal BenchVals MedQA
#19	claude-opus-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	16.3%	20%	$10.00	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#20	gpt-4.1 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and DuckDB NSQL Leaderboard all_execution_accuracy	15.4%	19%	$3.50	SWE-bench Verified LeaderboardDuckDB NSQL Leaderboard
#21	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals MedQA overall_accuracy_pct	15.3%	18%	—	Vectara HHEM LeaderboardVals MedQA
#22	gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	14.9%	19%	—	Vals Case Law v2Vals MedScribe
#23	claude-sonnet-4.6 Strong on Vals Tax Eval v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	14.8%	18%	$6.00	Vals Tax Eval v2Vals Finance Agent
#24	gpt-4o Strong on MEGA-Bench app:Planning and SWE-bench Verified Leaderboard swe_verified_resolved_pct	14.8%	25%	$0.26	MEGA-BenchSWE-bench Verified Leaderboard
#27	qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	14.2%	21%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#28	claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	14.0%	16%	—	Vals SWE-benchVals Mortgage Tax
#29	claude-opus-4-6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and AgentSet LLM Leaderboard elo_score	13.8%	16%	$10.00	SWE-bench Verified LeaderboardAgentSet LLM Leaderboard
#30	claude-opus-4-5-20251101-thinking Strong on Vals MedQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	13.6%	16%	—	Vals MedQAVals Mortgage Tax
#31	gemini-3.1-flash-lite-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	13.5%	20%	$0.56	Vals Mortgage TaxVectara HHEM Leaderboard
#33	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Tax Eval v2 overall_accuracy_pct	12.4%	21%	$0.28	Vals CorpFin v2Vals Tax Eval v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals Legal Bench

44 rows · 0.7% avg lift

Vals Tax Eval v2

43 rows · 0.6% avg lift

Vals LiveCodeBench

43 rows · 0.6% avg lift

Vals GPQA

43 rows · 0.5% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.config_debugging

Required modes

mode.format_preservation

Domains

domain.devops_sre

Related in DevOps & SRE

Runbook step assistant

Suggest safe runbook steps and escalation points grounded in docs.

Log triage

Interpret logs and propose safe diagnostic steps.

Kubernetes manifest generation

Generate K8s manifests with safe defaults and probes.

Terraform generation

Generate Terraform IaC with correct resources and safe defaults.