Insurance

Policy wording comparison

Compare policy wording against a standard and flag material differences.

task.compare_docs_difftask.contract_term_extraction

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-2.5-pro

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

31.9%

Best benchmark score

45.4%

Confidence

All ranked models — top 3

🥇

gemini-2.5-pro

31.9%

🥈

gemini-3.1-pro-preview

28.5%

🥉

gpt-5-2025-08-07

27.9%

Ranked Models

Evidence Quality

83%

Evidence Points

Top Signal

FACTS Benchmark Suite: facts_grounding_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	31.9%	45%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
🥈	gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct	28.5%	32%	$4.50	Vals Finance AgentFACTS Benchmark Suite
🥉	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct	27.9%	36%	—	FACTS Benchmark SuiteLEXam Leaderboard
#4	gpt-5-mini-2025-08-07 Strong on LEXam Leaderboard average_score_pct and Vals Finance Agent overall_accuracy_pct	26.1%	40%	—	LEXam LeaderboardVals Finance Agent
#5	Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	25.2%	37%	—	Vals CorpFin v2Vals Finance Agent
#6	gemini-3-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	24.9%	34%	$4.50	Vals Finance AgentVals CorpFin v2
#7	gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Galileo Agent Leaderboard v2 Insurance AC	22.9%	36%	—	Vectara HHEM LeaderboardGalileo Agent Leaderboard v2
#8	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Insurance TSQ and Vectara HHEM Leaderboard overall_hallucination_error_pct	21.9%	35%	$6.00	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#9	gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	21.8%	29%	$1.13	Vals CorpFin v2FACTS Benchmark Suite
#10	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	21.0%	26%	—	FACTS Benchmark SuiteVals CorpFin v2
#11	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	20.0%	29%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#12	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	19.8%	25%	$6.00	Vals Finance AgentVals CorpFin v2
#13	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	18.7%	32%	$0.17	FACTS Benchmark SuiteVectara HHEM Leaderboard
#14	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	18.7%	23%	—	Vectara HHEM LeaderboardVals CorpFin v2
#15	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	18.2%	28%	—	FACTS Benchmark SuiteVals CorpFin v2
#16	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	16.8%	26%	—	Vals Finance AgentVals CorpFin v2
#17	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	16.7%	32%	$0.28	Vals CorpFin v2Vals Finance Agent
#18	o3-20250416 Strong on Vals CorpFin v2 overall_accuracy_pct and SciArena Leaderboard rating_elo	16.5%	27%	$3.50	Vals CorpFin v2SciArena Leaderboard
#19	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.0%	23%	$0.28	Vals CorpFin v2Vals Finance Agent
#20	claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.9%	15%	—	Vals CorpFin v2Vals Finance Agent
#21	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.2%	18%	—	Vals CorpFin v2Vals Finance Agent
#22	deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LEXam Leaderboard average_score_pct	13.1%	19%	$0.27	SYCON Bench (Table 2)LEXam Leaderboard
#23	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	12.9%	18%	$6.00	Vectara HHEM LeaderboardVals CorpFin v2
#24	claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.8%	15%	—	Vals Finance AgentVals CorpFin v2
#25	qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Insurance AC and DuckDB NSQL Leaderboard all_execution_accuracy	12.8%	21%	—	Galileo Agent Leaderboard v2DuckDB NSQL Leaderboard
#26	claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.0%	15%	—	Vals Finance AgentVals CorpFin v2
#27	deepseek-v3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	11.9%	20%	—	Vectara HHEM LeaderboardVectara HHEM Leaderboard
#29	grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	11.7%	15%	—	Vals CorpFin v2Vals Finance Agent
#30	glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	11.7%	17%	—	Vals CorpFin v2Vals Finance Agent
#31	gemini-2.5-flash-lite Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Galileo Agent Leaderboard v2 Insurance AC	11.4%	17%	$0.17	Vectara HHEM LeaderboardGalileo Agent Leaderboard v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

44 rows · 1.3% avg lift

Vals Tax Eval v2

33 rows · 0.3% avg lift

Vals Legal Bench

32 rows · 0.3% avg lift

Vals MedQA

32 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.compare_docs_difftask.contract_term_extraction

Required modes

mode.long_contextmode.citationsmode.json_schema

Domains

domain.insurance_underwriting

Related in Insurance

Litigation risk memo

Summarize a claim into litigation risk drivers and mitigation steps.

Fraud signal summary

Summarize potential fraud indicators with conservative evidence framing.

Claims summary

Summarize claim history into timeline, status, and open items.

Underwriting submission ingest

Convert messy submission docs into structured underwriting fields.