BasedAGIBasedAGI

developer_tools

Best LLM for Autonomous Coding

Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.

This page is high-intent, but the current benchmark evidence for this use case is still limited. Treat the leader below as provisional.

Provisional leader

claude-opus-4-6

Best current option from the available benchmark evidence, but not yet a strong winner claim.

external/anthropic/claude-opus-4-6

28.4%

Score

30.9%

Confidence

19

Evidence

$10.00

per 1M tokens

Ranked Models

30

Evidence Quality

82%

Evidence Points

19

Top Signal

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

All Ranked Models

30 of 30 models
RankModelScore
🥇claude-opus-4-6

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Index average_score_pct

28.4%
🥈gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

24.9%
#4claude-sonnet-4

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

22.0%
#9GLM-5

Strong on OpenHands Issue Resolution issue_resolution_score_pct and Sonar Java Quality Leaderboard functional_skill_pct

15.9%
#10Kimi K2 Thinking

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct

15.9%
#11claude-sonnet-4.6

Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index greenfield_score_pct

15.8%
#13o3-20250416

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

15.2%
#15gemini-3-pro-preview

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct

14.5%
#16gemini-2.5-pro

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC

14.3%
#18gpt-4o-2024-05-13

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and RepoQA Official Results overall_average_pass_at_1_pct

14.2%
#19claude-opus-4

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

13.4%
#20gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct

13.3%
#21gpt-4.1

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

13.3%
#22gpt-4o-20241120

Strong on BigCodeBench Official bigcodebench_complete_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

13.3%
#23kimi-k2.5-thinking

Strong on OpenHands Index average_score_pct and Vals LiveCodeBench overall_accuracy_pct

13.2%
#24gpt-5.2-2025-12-11

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct

13.1%
#27o4-mini

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

12.2%
#30claude-opus-4-5-20251101

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Terminal-Bench 2 overall_accuracy_pct

11.9%
#31deepseek-r1

Strong on Aider Polyglot Leaderboard percent_correct_pct and Sonar Java Quality Leaderboard functional_skill_pct

11.5%
#32gpt-5-mini-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

11.1%
#33qwen-2.5-72b-instruct

Strong on Galileo Agent Leaderboard v2 Avg AC and BigCodeBench Official bigcodebench_complete_pct

11.1%
#34minimax-m2.1

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals SWE-bench overall_accuracy_pct

10.7%
#36gpt-4o

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and τ-bench Airline (Official README) tau_airline_pass1_pct

10.5%
#44Grok-4-0709

Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ

8.9%
#45deepseek-v3

Strong on BigCodeBench Official bigcodebench_complete_pct and Galileo Agent Leaderboard v2 Avg AC

8.8%
#46gpt-4o-2024-08-06

Strong on BigCodeBench Official bigcodebench_hard_complete_pct and Aider Code Editing Leaderboard percent_correct_pct

8.8%
#47Kimi-K2-Instruct

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC

8.7%
#49glm-4.7

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals LiveCodeBench overall_accuracy_pct

8.0%
#51gpt-4.1-mini-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct

7.9%
#54Llama 3.3 70B Instruct

Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct

7.0%

Head-to-Head: #1 vs #2

#1

Top Pick

claude-opus-4-6

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Index average_score_pct

28.4%

Conf 30.9%

#2

gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

24.9%

Conf 28.7%

Related Lookups