BasedAGIBasedAGI
Use Case ReportLive data

Best LLMs for Contract Review

Contract review is one of the highest-value AI use cases in enterprise — and one where the failure modes are most consequential. A model that misreads a limitation of liability clause, misses a perpetual license grant buried in a definitions section, or fails to flag an unusual indemnification provision isn't just unhelpful; it's a source of legal and financial exposure. The combination of high value and high stakes makes model selection for contract review more important than for most use cases.

The models that perform well here share a specific profile: strong long-document reasoning, precise language interpretation, domain knowledge of commercial contract structures, and calibrated uncertainty — they know when to flag something as unusual rather than silently passing it.

What Contract Review Actually Requires

Clause-level precision. A contract review model isn't summarizing a document — it's making judgments about specific provisions. Does this indemnification clause cover third-party IP claims? Is this limitation of liability mutual or one-sided? Does this assignment provision restrict change of control? Getting these right requires parsing dense, carefully constructed legal language at a level of precision that separates strong models from mediocre ones.

Document structure awareness. Commercial contracts are not linear documents. Defined terms established in section 1 govern interpretation throughout. Carve-outs in one section interact with obligations in another. Exhibits modify the base agreement. Models with strong long-context reasoning can track these cross-references; models that process contracts as flat text miss them.

Deviation detection. The core of contract review is not reading what's there — it's identifying what deviates from your standard position. This requires the model to understand both the specific contract and the baseline expectations for a contract of that type. Models trained with legal domain knowledge perform significantly better here than general-purpose models.

Risk calibration. Good contract review doesn't flag every non-standard clause as a risk — it distinguishes between unusual language that's acceptable, unusual language that warrants discussion, and unusual language that's a genuine problem. Models that over-flag create noise that erodes the value of the review. Models that under-flag create false confidence.

AI contract review is not a substitute for attorney review in high-stakes transactions. The models ranked here perform well at identifying common risk patterns and extracting clause-level information — they are tools for augmenting legal teams, not replacing legal judgment. Always have material contracts reviewed by qualified counsel.

Current Rankings

Contract redline summary

legal

Limited dataTop 15 · Live
#ModelScore
1gemini-2.5-pro

external/google/gemini-2-5-pro

32.8
2gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

31.6
3gpt-5-mini-2025-08-07

external/openai/gpt-5-mini-2025-08-07

31.5
4google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

30.5
5gemini-3-pro-preview

external/google/gemini-3-pro-preview

28.1
6anthropic/claude-sonnet-4

external/anthropic/claude-sonnet-4

25.9
7gemini-2.5-flash

external/google/gemini-2-5-flash

23.2
8gpt-4.1-20250414

external/openai/gpt-4-1-20250414

23.1
9Grok-4-0709

external/xai/grok-4-0709

22.9
10gemini-3-flash-preview

external/google/gemini-3-flash-preview

22.8
11anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

22.2
12google/gemini-3.1-flash-lite-preview

external/google/gemini-3-1-flash-lite-preview

21.5
13openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

21.3
14gpt-5.2-2025-12-11

external/openai/gpt-5-2-2025-12-11

21.0
15claude-opus-4-5-20251101

external/anthropic/claude-opus-4-5-20251101

19.9

What the Data Shows

Accuracy dimension strongly predicts contract review performance. Models with high Accuracy scores (factual reliability and hallucination resistance) consistently perform better on contract review than their general coding or reasoning scores would suggest. The failure mode most damaging in legal contexts — confidently stating something incorrect — is exactly what Accuracy measures.

Long context handling is a stronger predictor than raw intelligence. An average model with effective 128K context handling outperforms a strong model with degraded performance in long documents for most real-world contracts. Test your candidate model on actual contracts before relying on benchmark scores alone.

Smaller specialist models sometimes beat larger generalists. Models fine-tuned specifically on legal documents and contract structures can outperform larger general-purpose models on this task. The domain is constrained enough that specialization provides real lift.

Practical Deployment Notes

Structure your prompt around specific review objectives. "Review this contract" produces worse results than "Review this contract and identify: (1) any limitation of liability provisions and whether they are mutual, (2) any non-standard indemnification language, (3) any IP assignment provisions that could affect our ownership of deliverables." Specificity dramatically improves precision.

Use clause-level extraction before risk assessment. A two-step workflow — first extract all provisions of a given type, then assess each one — outperforms single-pass review. The extraction step is more reliable, and the assessment step works better with clean, isolated input.

Build a deviation library. The most effective contract AI deployments maintain a library of "standard" and "acceptable" language for key clause types. This gives the model explicit comparison anchors rather than asking it to infer your standards.

Related Use Cases

Full methodology at /methodology.

Related Reports