developer_tools

Best LLM for Code Generation

Benchmark-backed ranking of models for generating correct, secure code from requirements.

Full Analysis Available

Benchmark methodology, patterns in the data, and deployment notes

This page is high-intent, but the current benchmark evidence for this use case is still limited. Treat the leader below as provisional.

Provisional leader

claude-opus-4-6

Best current option from the available benchmark evidence, but not yet a strong winner claim.

external/anthropic/claude-opus-4-6

26.6%

Score

29.3%

Confidence

Evidence

$10.00

per 1M tokens

Runners-up:#2 gpt-5-2025-08-07 (24.4%)#3 anthropic/claude-sonnet-4 (22.2%)#4 GLM-5 (19.8%)

Ranked Models

Evidence Quality

82%

Evidence Points

Top Signal

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-opus-4-6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct	26.6%	29%	$10.00	SWE-bench Verified LeaderboardOpenHands Issue Resolution
🥈	gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	24.4%	28%	—	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#4	claude-sonnet-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct	22.2%	31%	$6.00	SWE-bench Verified LeaderboardSonar Java Quality Leaderboard
#7	GLM-5 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc	19.8%	32%	—	Sonar Java Quality LeaderboardSonar Java Quality Leaderboard
#9	Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct	19.0%	44%	$1.07	SWE-bench Verified LeaderboardSonar Java Quality Leaderboard
#12	claude-sonnet-4.6 Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index issue_resolution_score_pct	17.2%	30%	$6.00	OpenHands Issue ResolutionOpenHands Index
#13	gpt-4o-2024-05-13 Strong on RepoQA Official Results overall_average_pass_at_1_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	16.3%	22%	—	RepoQA Official ResultsSWE-bench Verified Leaderboard
#14	gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and AppBench Leaderboard percentile_score_pct	15.4%	19%	$4.50	SWE-bench Verified LeaderboardAppBench Leaderboard
#17	o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	14.4%	19%	$3.50	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#19	deepseek-r1 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Aider Polyglot Leaderboard percent_correct_pct	14.3%	20%	$0.27	Sonar Java Quality LeaderboardAider Polyglot Leaderboard
#20	gpt-4o-20241120 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct	14.3%	25%	—	BigCodeBench OfficialBigCodeBench Official
#21	minimax-m2.1 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals SWE-bench overall_accuracy_pct	13.8%	38%	$0.53	Sonar Java Quality LeaderboardVals SWE-bench
#22	gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct	13.4%	16%	—	SWE-bench Verified LeaderboardVals SWE-bench
#23	gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC	13.0%	18%	$3.44	SWE-bench Verified LeaderboardGalileo Agent Leaderboard v2
#25	claude-opus-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	12.6%	15%	$10.00	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#26	kimi-k2.5-thinking Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	12.3%	29%	—	Vals LiveCodeBenchVals SWE-bench
#27	gpt-4.1 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	12.2%	14%	$3.50	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#28	gpt-4.1-20250414 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC	12.1%	20%	—	SWE-bench Verified LeaderboardGalileo Agent Leaderboard v2
#30	claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Terminal-Bench 2 overall_accuracy_pct	11.8%	14%	—	SWE-bench Verified LeaderboardVals Terminal-Bench 2
#31	o4-mini Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct	11.7%	18%	$1.93	SWE-bench Verified LeaderboardAider Polyglot Leaderboard
#32	glm-4.7 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc	11.6%	17%	—	Sonar Java Quality LeaderboardSonar Java Quality Leaderboard
#35	gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct	11.4%	17%	—	SWE-bench Verified LeaderboardVals LiveCodeBench
#37	qwen-2.5-72b-instruct Strong on BigCodeBench Official bigcodebench_complete_pct and Galileo Agent Leaderboard v2 Avg AC	10.7%	14%	—	BigCodeBench OfficialGalileo Agent Leaderboard v2
#42	gpt-4o Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	9.8%	22%	$0.26	Sonar Java Quality LeaderboardSWE-bench Verified Leaderboard
#47	gpt-4o-2024-08-06 Strong on BigCodeBench Official bigcodebench_hard_complete_pct and Aider Code Editing Leaderboard percent_correct_pct	9.3%	18%	—	BigCodeBench OfficialAider Code Editing Leaderboard
#51	deepseek-v3 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct	8.5%	11%	—	BigCodeBench OfficialBigCodeBench Official
#52	Grok-4-0709 Strong on Vals LiveCodeBench overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg AC	8.3%	13%	—	Vals LiveCodeBenchGalileo Agent Leaderboard v2
#56	gpt-4o-mini-2024-07-18 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct	7.8%	16%	—	BigCodeBench OfficialBigCodeBench Official
#57	Llama 3.3 70B Instruct Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct	7.6%	13%	—	BigCodeBench OfficialBigCodeBench Official
#58	Phi-3-medium-128k-instruct Strong on RepoQA Official Results overall_average_pass_at_1_pct and BigCodeBench Official bigcodebench_complete_pct	7.4%	11%	—	RepoQA Official ResultsBigCodeBench Official

Head-to-Head: #1 vs #2

Top Pick

claude-opus-4-6

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct

26.6%

Conf 29.3%

gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

24.4%

Conf 28.5%

Full Comparison with Benchmark Evidence →

Full Use-Case Page Browse All Use Cases How We Score

Related Lookups

Best LLM for Debugging

Find the top-ranked models for localizing bugs and proposing fixes with explanations.

Best LLM for Unit Test Generation

Ranked models for generating meaningful unit tests and edge cases from code.

Best LLM for Code Review

Compare models for automated PR review covering correctness, security, and maintainability.

Best LLM for Autonomous Coding

Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.

Best LLM for Function Calling

Compare models for reliable tool use, function selection, and multi-step API orchestration.

Best LLM for Refactoring

Ranked models for safely refactoring code while preserving behavior and improving clarity.