Legal

Regulatory summary

Summarize and compare regulatory text with conservative interpretation.

task.summarize_doctask.claim_check_with_evidence

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

31.8%

Best benchmark score

36.4%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

31.8%

🥈

gemini-2.5-pro

31.0%

🥉

gpt-5-mini-2025-08-07

30.0%

Ranked Models

Evidence Quality

84%

Evidence Points

Top Signal

SimpleQA Verified: simpleqa_verified_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and FACTS Benchmark Suite facts_grounding_score_pct	31.8%	36%	$4.50	SimpleQA VerifiedFACTS Benchmark Suite
🥈	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct	31.0%	49%	$3.44	FACTS Benchmark SuiteLEXam Leaderboard
🥉	gpt-5-mini-2025-08-07 Strong on Vals Case Law v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	30.0%	44%	—	Vals Case Law v2FACTS Benchmark Suite
#4	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct	29.5%	38%	—	FACTS Benchmark SuiteLEXam Leaderboard
#5	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Vals Legal Bench overall_accuracy_pct	27.0%	38%	$6.00	Galileo Agent Leaderboard v2Vals Legal Bench
#6	gemini-3-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Legal Bench overall_accuracy_pct	26.7%	37%	$4.50	SimpleQA VerifiedVals Legal Bench
#7	gpt-4.1-20250414 Strong on MMLongBench-Doc Leaderboard acc_score_pct and Vals Case Law v2 overall_accuracy_pct	25.6%	37%	—	MMLongBench-Doc LeaderboardVals Case Law v2
#8	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Galileo Agent Leaderboard v2 Avg TSQ	24.2%	35%	$0.17	FACTS Benchmark SuiteGalileo Agent Leaderboard v2
#9	Grok-4-0709 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	23.9%	34%	—	Vals Legal BenchVals Case Law v2
#10	gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	23.8%	33%	$1.13	Vals Legal BenchFACTS Benchmark Suite
#11	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	23.1%	30%	$6.00	Vals Finance AgentVals Legal Bench
#12	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	22.4%	32%	$0.56	FACTS Benchmark SuiteVals Legal Bench
#13	gpt-5.4-2026-03-05 Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	22.2%	28%	—	Vals Legal BenchVectara HHEM Leaderboard
#14	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	22.0%	27%	—	FACTS Benchmark SuiteVals Legal Bench
#15	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	20.7%	30%	—	FACTS Benchmark SuiteVals Legal Bench
#16	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	20.2%	36%	$0.28	Vals CorpFin v2Vals Legal Bench
#17	gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	19.4%	28%	—	Vals Case Law v2Vals Legal Bench
#18	grok-4-1-fast-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	17.9%	26%	$0.28	Vals Legal BenchVals CorpFin v2
#19	o3-20250416 Strong on Vals Legal Bench overall_accuracy_pct and SimpleQA Verified simpleqa_verified_score_pct	16.4%	26%	$3.50	Vals Legal BenchSimpleQA Verified
#20	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals Legal Bench overall_accuracy_pct	15.6%	21%	$6.00	Vectara HHEM LeaderboardVals Legal Bench
#21	deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LEXam Leaderboard average_score_pct	15.0%	24%	$0.27	SYCON Bench (Table 2)LEXam Leaderboard
#22	claude-opus-4-6-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.8%	17%	—	Vals Legal BenchVals CorpFin v2
#23	mistral-large-2512 Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.4%	24%	—	Vals Legal BenchVals CorpFin v2
#24	claude-opus-4-1-20250805 Strong on Vals Legal Bench overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	14.2%	24%	—	Vals Legal BenchFACTS Benchmark Suite
#25	claude-opus-4-5-20251101-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	14.0%	17%	—	Vals Legal BenchVals Finance Agent
#26	gpt-4.1 Strong on LEXam Leaderboard average_score_pct and LanguageBench translation_to:bleu	13.4%	17%	$3.50	LEXam LeaderboardLanguageBench
#27	claude-sonnet-4-5-20250929-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.2%	17%	—	Vals Legal BenchVals Finance Agent
#28	grok-4-1-fast-non-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.2%	23%	$0.28	Vals Legal BenchVals Finance Agent
#31	glm-5-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.5%	20%	—	Vals Legal BenchVals CorpFin v2
#32	deepseek-v3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and SYCON Bench (Table 2) sycon_unethical_tof_pct	12.3%	19%	—	Vectara HHEM LeaderboardSYCON Bench (Table 2)

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

44 rows · 1.1% avg lift

Vals Legal Bench

44 rows · 1.8% avg lift

Vals Finance Agent

31 rows · 1.1% avg lift

Vals Case Law v2

30 rows · 1.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.summarize_doctask.claim_check_with_evidence

Required modes

mode.long_contextmode.citations

Domains

domain.legal_regulatory

Related in Legal

Contract Drafting & Redlining

Drafting, reviewing, and suggesting edits to legal contracts and agreements.

Contract Q&A (RAG grounded)

Answer contract questions grounded in the actual contract text.

Contract redline summary

Summarize material changes between contract versions with clause refs.

Clause playbook check

Check extracted terms against a playbook and flag deviations.