Risk & Eval

Jailbreak resistance (eval)

Measure robustness to adversarial prompts that attempt to bypass policy.

task.jailbreak_resistance

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

Llama-2-7b-chat-hf

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

23.9%

Best benchmark score

29.8%

Confidence

All ranked models — top 3

🥇

Llama-2-7b-chat-hf

23.9%

🥈

gpt-5-2025-08-07

20.7%

🥉

gpt-4o-mini-2024-07-18

20.4%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

LLM Trustworthy Leaderboard: fairness

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	Llama-2-7b-chat-hf Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy	23.9%	30%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#4	gpt-5-2025-08-07 Strong on UGI Leaderboard Hazardous and Aider Polyglot Leaderboard percent_correct_pct	20.7%	28%	—	UGI LeaderboardAider Polyglot Leaderboard
#5	gpt-4o-mini-2024-07-18 Strong on LLM Trustworthy Leaderboard privacy and LLM Trustworthy Leaderboard adv	20.4%	39%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#6	Meta-Llama-3-8B-Instruct Strong on LLM Trustworthy Leaderboard adv and LLM Trustworthy Leaderboard privacy	20.3%	31%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#7	gemini-2.5-pro Strong on UGI Leaderboard Hazardous and Galileo Agent Leaderboard v2 Avg AC	20.3%	31%	$3.44	UGI LeaderboardGalileo Agent Leaderboard v2
#8	gemini-3.1-pro-preview Strong on UGI Leaderboard Hazardous and Vals Mortgage Tax overall_accuracy_pct	20.2%	23%	$4.50	UGI LeaderboardVals Mortgage Tax
#10	gemma-7b-it Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy	19.4%	31%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#11	gemma-2b-it Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy	19.4%	30%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#12	gpt-4o-2024-05-13 Strong on LLM Trustworthy Leaderboard privacy and LLM Trustworthy Leaderboard adv	19.4%	38%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#14	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	18.9%	30%	$6.00	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#15	Grok-4-0709 Strong on UGI Leaderboard Hazardous and Galileo Agent Leaderboard v2 Avg TSQ	18.6%	27%	—	UGI LeaderboardGalileo Agent Leaderboard v2
#16	falcon-7b-instruct Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy	18.2%	29%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#17	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and UGI Leaderboard Hazardous	17.9%	25%	—	Galileo Agent Leaderboard v2UGI Leaderboard
#18	gemini-3-flash-preview Strong on UGI Leaderboard Hazardous and Vals Legal Bench overall_accuracy_pct	17.8%	22%	$1.13	UGI LeaderboardVals Legal Bench
#19	o3-20250416 Strong on UGI Leaderboard Hazardous and Aider Polyglot Leaderboard percent_correct_pct	17.0%	23%	$3.50	UGI LeaderboardAider Polyglot Leaderboard
#20	zephyr-7b-beta Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard privacy	16.8%	30%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#22	claude-sonnet-4.6 Strong on UGI Leaderboard Hazardous and Vals Tax Eval v2 overall_accuracy_pct	16.3%	19%	$6.00	UGI LeaderboardVals Tax Eval v2
#23	gpt-5.1-2025-11-13 Strong on UGI Leaderboard Hazardous and Vals Case Law v2 overall_accuracy_pct	16.2%	20%	—	UGI LeaderboardVals Case Law v2
#24	gpt-5-mini-2025-08-07 Strong on Vals MedQA overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	15.9%	24%	—	Vals MedQAVals LiveCodeBench
#25	gemini-3-pro-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	14.8%	22%	$4.50	Vals Mortgage TaxVals Legal Bench
#27	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals MedQA overall_accuracy_pct	14.4%	19%	—	Vectara HHEM LeaderboardVals MedQA
#28	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Tax Eval v2 overall_accuracy_pct	14.1%	20%	—	FACTS Benchmark SuiteVals Tax Eval v2
#29	kimi-k2.5-thinking Strong on UGI Leaderboard Hazardous and Vals CorpFin v2 overall_accuracy_pct	13.4%	19%	—	UGI LeaderboardVals CorpFin v2
#30	grok-4-fast-reasoning Strong on UGI Leaderboard Hazardous and Vals CorpFin v2 overall_accuracy_pct	13.2%	22%	$0.28	UGI LeaderboardVals CorpFin v2
#31	alpaca-native Strong on LLM Trustworthy Leaderboard fairness and LLM Trustworthy Leaderboard adv	13.2%	29%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#32	o4-mini Strong on UGI Leaderboard Hazardous and Aider Polyglot Leaderboard percent_correct_pct	13.2%	21%	$1.93	UGI LeaderboardAider Polyglot Leaderboard
#33	grok-4-1-fast-reasoning Strong on UGI Leaderboard Hazardous and Vals CorpFin v2 overall_accuracy_pct	12.7%	19%	$0.28	UGI LeaderboardVals CorpFin v2
#34	Mistral-7B-OpenOrca Strong on LLM Trustworthy Leaderboard privacy and LLM Trustworthy Leaderboard adv	12.5%	30%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#35	claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	12.3%	14%	—	Vals SWE-benchVals Mortgage Tax
#36	gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	12.2%	18%	$0.17	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals Legal Bench

44 rows · 0.6% avg lift

Vals Tax Eval v2

43 rows · 0.5% avg lift

Vals GPQA

43 rows · 0.5% avg lift

Vals LiveCodeBench

42 rows · 0.5% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.jailbreak_resistance

Required modes

none

Domains

domain.general_business

Related in Risk & Eval

Disinformation and manipulation resistance (eval)

Measure refusal and safe handling of deceptive content generation requests.

Crisis escalation protocol (eval)

Measure safe crisis escalation behavior under the selected policy.

Overrefusal (eval)

Measure how often benign requests are incorrectly refused.

Refusal profile (eval)

Measure refusal/overrefusal rates across predefined categories.