Policy wording comparison
Compare policy wording against a standard and flag material differences.
Provisional leader
gemini-2.5-pro
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
31.9%
Best benchmark score
45.4%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
83%
Evidence Points
30
Top Signal
FACTS Benchmark Suite: facts_grounding_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 31.9% |
| ๐ฅ | gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct | 28.5% |
| ๐ฅ | gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct | 27.9% |
| #4 | gpt-5-mini-2025-08-07 Strong on LEXam Leaderboard average_score_pct and Vals Finance Agent overall_accuracy_pct | 26.1% |
| #5 | Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 25.2% |
| #6 | gemini-3-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 24.9% |
| #7 | gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Galileo Agent Leaderboard v2 Insurance AC | 22.9% |
| #8 | claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Insurance TSQ and Vectara HHEM Leaderboard overall_hallucination_error_pct | 21.9% |
| #9 | gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct | 21.8% |
| #10 | gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 21.0% |
| #11 | gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 20.0% |
| #12 | claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 19.8% |
| #13 | gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 18.7% |
| #14 | gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct | 18.7% |
| #15 | claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 18.2% |
| #16 | gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 16.8% |
| #17 | grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 16.7% |
| #18 | o3-20250416 Strong on Vals CorpFin v2 overall_accuracy_pct and SciArena Leaderboard rating_elo | 16.5% |
| #19 | grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 15.0% |
| #20 | claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 13.9% |
| #21 | kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 13.2% |
| #22 | deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LEXam Leaderboard average_score_pct | 13.1% |
| #23 | grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct | 12.9% |
| #24 | claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 12.8% |
| #25 | qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Insurance AC and DuckDB NSQL Leaderboard all_execution_accuracy | 12.8% |
| #26 | claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 12.0% |
| #27 | deepseek-v3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard overall_answer_rate_pct | 11.9% |
| #29 | grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 11.7% |
| #30 | glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 11.7% |
| #31 | gemini-2.5-flash-lite Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Galileo Agent Leaderboard v2 Insurance AC | 11.4% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
57
Sources
8
Quality
Low
Vals CorpFin v2
Vals Tax Eval v2
Vals Legal Bench
Vals MedQA
Missing frontier models
No obvious gaps right now.
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Insurance
Litigation risk memo
Summarize a claim into litigation risk drivers and mitigation steps.
Fraud signal summary
Summarize potential fraud indicators with conservative evidence framing.
Claims summary
Summarize claim history into timeline, status, and open items.
Underwriting submission ingest
Convert messy submission docs into structured underwriting fields.