Code generation is the most heavily benchmarked use case in the entire LLM landscape — and the one where the gap between benchmark performance and real-world usefulness is widest. Every major lab leads with coding benchmarks because they're objective, automatable, and make for good press releases. The model that "solves 70% of SWE-bench" sounds compelling until you realize that's measured in a very specific scaffolded evaluation environment that may or may not match how you're actually using it.
This report cuts through that by aggregating across multiple coding benchmarks, weighting them by what they predict about real-world coding utility, and giving you a ranking you can actually act on.
What "Code Generation" Actually Covers
Code generation is not one task — it's a cluster of related but distinct capabilities:
Function completion — Given a function signature and docstring, write the body. This is the oldest and most benchmarked coding task. Models that do well here have absorbed a large amount of code pattern knowledge.
Multi-file problem solving — Given a real codebase, fix a bug or implement a feature that requires changes across multiple files. This is what SWE-bench measures and is much harder. It requires understanding codebases, not just writing code from scratch.
Test generation — Write unit tests for existing code. This is a distinct capability from writing the code itself, and models that excel at it tend to have good abstract understanding of program behavior.
Refactoring — Improve existing code without changing behavior. Aider benchmarks track this specifically. Good refactoring requires understanding what code does, not just how to write code.
Instruction following for code — Given natural language requirements, produce code that satisfies them. The natural language understanding component matters here as much as the coding ability.
The correlation between benchmark scores and real IDE usefulness is moderate, not strong. Models that perform well in scaffolded evaluations don't always perform well in the messy, context-dependent reality of an active codebase. This is why we aggregate across multiple benchmarks rather than relying on any single score.
The Benchmark Landscape
SWE-bench Verified is the most demanding real-world coding benchmark. It uses actual GitHub issues from popular open-source projects — real bugs that real developers filed. Solving them requires understanding existing codebases, not just writing code from scratch. High SWE-bench scores are the strongest signal for multi-file coding ability.
LiveCodeBench is a contamination-resistant competitive programming benchmark that continuously adds new problems. Because problems are added after most models' training cutoffs, it can't be gamed by memorization. Strong LiveCodeBench performance indicates genuine problem-solving ability.
BigCodeBench extends HumanEval to more complex, realistic programming tasks with diverse library use. It's harder than HumanEval and better predicts real-world code quality.
Aider benchmarks measure real coding agent performance — how well the model performs when given a code editing tool and asked to complete realistic tasks. These are the most practically relevant for developers using AI coding assistants.
Current Rankings
Code generation
developer tools
| # | Model | Score |
|---|---|---|
| 1 | claude-opus-4-6 external/anthropic/claude-opus-4-6 | 26.6 |
| 2 | gpt-5-2025-08-07 external/openai/gpt-5-2025-08-07 | 24.4 |
| 3 | anthropic/claude-sonnet-4 external/anthropic/claude-sonnet-4 | 22.2 |
| 4 | GLM-5 zai-org/GLM-5 | 19.8 |
| 5 | Kimi K2 Thinking external/kimi/kimi-k2-thinking | 19.0 |
| 6 | anthropic/claude-sonnet-4.6 external/anthropic/claude-sonnet-4-6 | 17.2 |
| 7 | gpt-4o-2024-05-13 external/openai/gpt-4o-2024-05-13 | 16.3 |
| 8 | gemini-3-pro-preview external/google/gemini-3-pro-preview | 15.4 |
| 9 | o3-20250416 external/openai/o3-20250416 | 14.4 |
| 10 | deepseek/deepseek-r1 external/deepseek/deepseek-r1 | 14.3 |
| 11 | gpt-4o-20241120 external/openai/gpt-4o-20241120 | 14.3 |
| 12 | minimax/minimax-m2.1 external/minimax/minimax-m2-1 | 13.8 |
| 13 | gpt-5.2-2025-12-11 external/openai/gpt-5-2-2025-12-11 | 13.4 |
| 14 | gemini-2.5-pro external/google/gemini-2-5-pro | 13.0 |
| 15 | anthropic/claude-opus-4 external/anthropic/claude-opus-4 | 12.6 |
Reading These Rankings
The scores above reflect performance across the benchmark families described above, weighted by their predictive validity for practical code generation. A few things to note:
Top-ranked models tend to be large. Code generation is one of the tasks where parameter count matters most. The cognitive load of tracking a codebase, understanding intent, generating correct syntax, and satisfying edge cases simultaneously benefits from scale in a way that simpler tasks don't.
Open-weight models are increasingly competitive. The gap between the best open-weight coding models and the best closed-source models has narrowed dramatically. For many code generation tasks — particularly function completion and test generation — the best open-weight models are now practical alternatives to proprietary APIs.
Instruction-tuned variants outperform base models significantly. Code-specific fine-tuning (on code instruction datasets, coding competitions, or RLHF from developer feedback) provides a large lift over pure pretraining. The same base model with and without coding fine-tuning can be several positions apart in these rankings.
When Model Choice Actually Matters
Not all code generation tasks are equally model-sensitive. Here's where the choice of model has the largest impact:
High impact:
- Complex algorithmic problems (sorting, graph algorithms, dynamic programming)
- Debugging non-obvious issues in unfamiliar code
- Generating code that integrates with specific APIs or libraries
- Multi-file refactoring tasks
- Writing code with non-trivial requirements or constraints
Lower impact:
- Standard boilerplate (CRUD operations, REST endpoints, simple classes)
- Well-documented patterns (React components, SQL queries for standard schemas)
- Tasks where you're providing complete context and examples
For boilerplate and standard patterns, mid-tier models are often sufficient and cheaper to run. Save the top-ranked models for the tasks where they actually make a difference.
Code generation scores don't predict code security. A model that generates syntactically correct, functionally working code may still produce code with SQL injection vulnerabilities, insecure random number generation, or unsafe deserialization. Security-critical code should always be reviewed regardless of which model generated it.
Open-Source vs. Proprietary
For teams evaluating open-weight models specifically:
Open-weight models in the top tier of these rankings are genuinely production-viable for most code generation tasks. The practical decision factors are:
- Latency — Self-hosted open-weight models can have significantly lower latency at scale than API calls, which matters for IDE integrations
- Privacy — Code is often proprietary; running a local model means your code never leaves your infrastructure
- Cost at scale — At high volume, self-hosted inference costs are substantially lower than API costs
- Fine-tuning — Open-weight models can be fine-tuned on your codebase, which closed models can't
For teams with standard workloads and no particular privacy or cost constraints, the top proprietary models still have a small quality edge on the hardest tasks. For most practical use cases, the best open-weight alternatives are within the margin where other factors dominate the decision.
Related Use Cases
Code generation doesn't exist in isolation. Depending on your actual workflow:
- Debugging — If you're fixing bugs more than writing new code, check the debugging-specific use case rankings
- Code review — Reviewing code for correctness and quality is a distinct task from writing it
- Unit test generation — Generating tests from existing code is its own use case with different benchmark characteristics
- Documentation from code — Generating documentation, comments, and README files from code
Full rankings for each of these are available in the use cases browser.
Methodology
Rankings on this page are computed from live benchmark ingestion across the sources described above. Scores update as new benchmark data is ingested. Full methodology at /methodology.