Reasoning ability is the most heavily benchmarked dimension in the field — and also the most overhyped. Every major lab releases models with claims about "advanced reasoning" backed by impressive leaderboard numbers. But aggregate leaderboard performance and genuine step-by-step reasoning ability are different things, and the gap between them matters enormously for whether a model is actually useful on hard tasks.
Our IQ score tries to capture the latter: not just whether a model gets the right answer, but whether it can reliably get the right answer on tasks that require genuine deduction.
What IQ Measures in LLMs
Reasoning in language models breaks into several distinguishable capabilities:
Deductive reasoning — Given premises, can the model reliably draw correct conclusions? This sounds basic but is surprisingly hard for models at scale. Deductive chains longer than a few steps reveal models that are good at pattern-matching answers rather than actually reasoning.
Mathematical problem-solving — Not arithmetic (calculators solve that) but multi-step mathematical reasoning: setting up problems, choosing the right approach, carrying the chain of logic through to a correct answer. MATH and MathArena benchmark this rigorously.
Scientific and graduate-level reasoning — GPQA (Graduate-level Professional Questions) tests whether a model can reason correctly about biology, chemistry, and physics at a level that requires genuine domain understanding. This is a strong filter against models that are good at surface-level recall.
Abstract and structured thinking — BBH (BIG-Bench Hard) targets tasks that are deliberately difficult for language models: algorithmic reasoning, formal logic, object tracking across complex scenarios. High BBH scores indicate a model that handles novel structured problems rather than pattern-matching to training data.
How We Score It
The IQ dimension weights these benchmark families according to their reliability and discriminative power:
GPQA carries significant weight because it's hard to game — the questions are at a level where memorization doesn't help and actual reasoning is required. Models that score well here are doing something genuinely different from models that score well on easier benchmarks.
MATH / MathArena captures mathematical reasoning across difficulty levels, from competition problems to graduate coursework. These are clean, objective evaluations where there's no ambiguity about correctness.
BBH provides breadth — 23 challenging task types that collectively measure the kind of general abstract reasoning that transfers across domains.
ARC-Challenge is the floor-setter — a model that can't score well here has fundamental reasoning gaps.
The gap between the top and bottom of the IQ rankings is larger than for any other dimension. Reasoning ability varies enormously across models in a way that EQ or Creativity don't. This makes IQ the most important dimension to check when you're selecting a model for technically demanding tasks.
Current Rankings
IQ Rankings
Reasoning & problem-solving ability
What the Data Shows
Several patterns stand out consistently:
Reasoning ability scales with model size more than most other dimensions. Unlike EQ (which is heavily training-distribution dependent), IQ benefits substantially from parameter count. The top IQ performers tend to be larger models, and the correlation holds more strongly than for other dimensions.
Chain-of-thought training has a large effect. Models explicitly trained with reasoning traces — where they're rewarded for showing their work before giving an answer — score meaningfully higher than architecturally equivalent models without this training. This is one of the clearest causal relationships in the benchmark data.
Frontier model improvements have been rapid. The distance between the top and median model on IQ benchmarks has compressed significantly over the past year as more labs invest in reasoning-focused training. What was exceptional performance 12 months ago is now competitive but not leading.
Open-weight models are closing the gap. The top open-weight models now approach frontier closed-model IQ performance, particularly on MATH tasks. The gap persists on GPQA-level scientific reasoning but is narrower than it was.
When IQ Is the Right Signal
IQ should be your primary ranking signal when:
-
Complex multi-step analysis — Financial modeling, legal reasoning, research synthesis, any task where the model needs to chain many inferential steps without losing the thread.
-
Mathematics and quantitative tasks — Data analysis, statistical interpretation, formula derivation, quantitative research assistance.
-
Scientific and technical reasoning — Literature review, hypothesis generation, experiment design, technical problem-solving in science and engineering.
-
Code generation for hard problems — Not boilerplate code (any model handles that), but algorithmic problem-solving, debugging complex logic, designing data structures. IQ predicts coding ability on hard problems better than coding-specific benchmarks.
-
Agentic tasks with long reasoning chains — Multi-step workflows, planning tasks, and tool-use chains where errors compound. High-IQ models make fewer cascading errors.
High IQ doesn't guarantee high accuracy. A model can reason correctly but hallucinate the facts it's reasoning from. For tasks where factual grounding matters, check the Accuracy dimension alongside IQ.
Methodology & Confidence
IQ scores require multi-source benchmark coverage. Models with fewer than two reasoning benchmark families represented have lower confidence scores and may move significantly as coverage improves. The rankings above reflect current evidence — high-confidence entries (60%+) should be considered stable; lower-confidence entries may shift.
Full methodology at /methodology.