developer_tools
Best LLM for Code Generation
Benchmark-backed ranking of models for generating correct, secure code from requirements.
Full Analysis Available
Benchmark methodology, patterns in the data, and deployment notes
Provisional leader
claude-opus-4-6
Best current option from the available benchmark evidence, but not yet a strong winner claim.
external/anthropic/claude-opus-4-6
26.6%
Score
29.3%
Confidence
19
Evidence
$10.00
per 1M tokens
Ranked Models
30
Evidence Quality
82%
Evidence Points
19
Top Signal
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| 🥇 | claude-opus-4-6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct | 26.6% |
| 🥈 | gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 24.4% |
| #4 | claude-sonnet-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct | 22.2% |
| #7 | GLM-5 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc | 19.8% |
| #9 | Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct | 19.0% |
| #12 | claude-sonnet-4.6 Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index issue_resolution_score_pct | 17.2% |
| #13 | gpt-4o-2024-05-13 Strong on RepoQA Official Results overall_average_pass_at_1_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 16.3% |
| #14 | gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and AppBench Leaderboard percentile_score_pct | 15.4% |
| #17 | o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 14.4% |
| #19 | deepseek-r1 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Aider Polyglot Leaderboard percent_correct_pct | 14.3% |
| #20 | gpt-4o-20241120 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 14.3% |
| #21 | minimax-m2.1 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals SWE-bench overall_accuracy_pct | 13.8% |
| #22 | gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct | 13.4% |
| #23 | gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 13.0% |
| #25 | claude-opus-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 12.6% |
| #26 | kimi-k2.5-thinking Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct | 12.3% |
| #27 | gpt-4.1 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 12.2% |
| #28 | gpt-4.1-20250414 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 12.1% |
| #30 | claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Terminal-Bench 2 overall_accuracy_pct | 11.8% |
| #31 | o4-mini Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 11.7% |
| #32 | glm-4.7 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc | 11.6% |
| #35 | gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct | 11.4% |
| #37 | qwen-2.5-72b-instruct Strong on BigCodeBench Official bigcodebench_complete_pct and Galileo Agent Leaderboard v2 Avg AC | 10.7% |
| #42 | gpt-4o Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 9.8% |
| #47 | gpt-4o-2024-08-06 Strong on BigCodeBench Official bigcodebench_hard_complete_pct and Aider Code Editing Leaderboard percent_correct_pct | 9.3% |
| #51 | deepseek-v3 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 8.5% |
| #52 | Grok-4-0709 Strong on Vals LiveCodeBench overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg AC | 8.3% |
| #56 | gpt-4o-mini-2024-07-18 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 7.8% |
| #57 | Llama 3.3 70B Instruct Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 7.6% |
| #58 | Phi-3-medium-128k-instruct Strong on RepoQA Official Results overall_average_pass_at_1_pct and BigCodeBench Official bigcodebench_complete_pct | 7.4% |
Head-to-Head: #1 vs #2
#1
Top Pickclaude-opus-4-6
Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct
Conf 29.3%
#2
gpt-5-2025-08-07
Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct
Conf 28.5%
Related Lookups
Best LLM for Debugging
Find the top-ranked models for localizing bugs and proposing fixes with explanations.
Best LLM for Unit Test Generation
Ranked models for generating meaningful unit tests and edge cases from code.
Best LLM for Code Review
Compare models for automated PR review covering correctness, security, and maintainability.
Best LLM for Autonomous Coding
Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.
Best LLM for Function Calling
Compare models for reliable tool use, function selection, and multi-step API orchestration.
Best LLM for Refactoring
Ranked models for safely refactoring code while preserving behavior and improving clarity.