Disinformation and manipulation resistance (eval)
Measure refusal and safe handling of deceptive content generation requests.
Provisional leader
gemini-3.1-pro-preview
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
33.2%
Best benchmark score
38.1%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
83%
Evidence Points
23
Top Signal
Vals Finance Agent: overall_accuracy_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct | 33.2% |
| ๐ฅ | gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 28.0% |
| ๐ฅ | gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Finance Agent overall_accuracy_pct | 27.3% |
| #4 | gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct | 26.2% |
| #5 | gpt-5-mini-2025-08-07 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 25.2% |
| #6 | Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 24.3% |
| #7 | claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 23.9% |
| #8 | gemini-3-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 23.3% |
| #9 | gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 22.5% |
| #10 | gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 22.4% |
| #11 | gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct | 21.4% |
| #12 | gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct | 21.0% |
| #13 | gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 20.5% |
| #14 | claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 20.0% |
| #15 | grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 19.9% |
| #16 | claude-sonnet-4 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and FACTS Benchmark Suite facts_grounding_score_pct | 19.8% |
| #17 | o3-20250416 Strong on Vals CorpFin v2 overall_accuracy_pct and SciArena Leaderboard rating_elo | 19.3% |
| #18 | grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 16.9% |
| #19 | gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 16.3% |
| #20 | kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 15.9% |
| #21 | grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct | 15.9% |
| #22 | claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 15.5% |
| #23 | claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 14.4% |
| #24 | claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 13.5% |
| #26 | grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 13.2% |
| #27 | glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct | 13.1% |
| #28 | gpt-4o-2024-05-13 Strong on LLM Trustworthy Leaderboard privacy and LLM Trustworthy Leaderboard adv | 12.9% |
| #29 | o4-mini Strong on Vals CorpFin v2 overall_accuracy_pct and Vals CorpFin v2 shared_max_context_accuracy_pct | 12.9% |
| #30 | Llama-2-7b-chat-hf Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy | 12.7% |
| #31 | Kimi K2 Thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals CorpFin v2 shared_max_context_accuracy_pct | 12.6% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
62
Sources
8
Quality
Low
Vals CorpFin v2
Vals MedQA
Vals Tax Eval v2
Vals Legal Bench
Missing frontier models
No obvious gaps right now.
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Risk & Eval
Crisis escalation protocol (eval)
Measure safe crisis escalation behavior under the selected policy.
Jailbreak resistance (eval)
Measure robustness to adversarial prompts that attempt to bypass policy.
Overrefusal (eval)
Measure how often benign requests are incorrectly refused.
Refusal profile (eval)
Measure refusal/overrefusal rates across predefined categories.