Safest & Most Aligned LLMs (2026)

Most safety rankings are measuring the wrong thing. They test whether a model refuses harmful requests — full stop. That's one side of the equation, and it matters. But it ignores the other side: a model that refuses legitimate requests is also unsafe, just in a different direction. A medical information tool that declines to discuss drug interactions "to be safe" is not safe — it's a liability with a safety badge on it.

The Based score measures both. It penalizes over-refusal as well as under-refusal, and it rewards models that have calibrated their safety behavior to be genuinely useful without being reckless. A high Based score means the model can handle sensitive topics thoughtfully, not that it handles nothing.

The Two Failure Modes of Safety

Under-refusal is the obvious one: the model produces content it shouldn't — harmful instructions, discriminatory outputs, dangerous medical advice, privacy violations. This is what makes headlines and what most safety evaluations measure.

Over-refusal is the silent failure: the model declines things it should handle — basic medical questions, legal explanations, historical atrocities discussed in educational contexts, creative writing with conflict or moral complexity, any topic that touches a keyword in its training. This failure mode doesn't make headlines but it makes products unusable.

Both have real costs. Under-refusal creates harm. Over-refusal creates useless products, breaks user trust, and — in high-stakes contexts like medical or legal assistance — can cause harm by withholding genuinely needed information.

The models with the highest Based scores are typically not the most restricted models — they're the most calibrated ones. They have clearly internalized which topics are genuinely dangerous versus merely uncomfortable, and they can engage thoughtfully with the latter while declining the former.

How We Score It

Based is the most opinionated dimension in our scoring system because it requires taking positions on where the right calibration point is. Our approach:

Refusal calibration benchmarks measure false positive rate on legitimate requests — how often a model refuses things it should handle. A high false positive rate is penalized proportionally.

Safety evaluation suites measure true positive rate on genuinely harmful requests — how often a model correctly declines things it shouldn't do. Low true positive rate is penalized more heavily because this is the more serious failure mode.

Bias and consistency evaluations test whether the model applies its policies consistently across demographic groups, ideological positions, and topic areas. Inconsistent application of safety policies is itself a safety problem.

The Based score is intentionally weighted so that a model cannot achieve a high score purely by refusing everything. That's the key design choice — we're measuring genuine safety behavior, not risk minimization theater.

Current Rankings

Based Rankings

Safety alignment & refusal calibration

Top 25 · Live

#	Model	Based	Confidence	IQ	EQ
1	meta-llama/Meta-Llama-3-8B-Instruct	100.0	6%	—	76.8
2	external/google/gemini-3-flash-preview	100.0	7%	79.1	—
3	external/deepseek/deepseek-r1	100.0	16%	61.9	35.7
4	deepseek-ai/DeepSeek-V3.2-Speciale	100.0	7%	—	—
5	external/anthropic/claude-sonnet-4-6	94.0	7%	85.1	71.4
6	external/openai/gpt-5-1-2025-11-13	88.0	7%	80.5	—
7	zai-org/GLM-5	88.0	7%	67.2	—
8	moonshotai/Kimi-K2-Thinking	86.4	7%	—	—
9	deepseek-ai/DeepSeek-V3-0324	86.4	7%	—	—
10	moonshotai/Kimi-K2-Instruct	82.0	7%	59.1	65.4
11	external/xai-org/grok-4-1-fast-reasoning	82.0	7%	56.0	85.5
12	deepseek-ai/DeepSeek-R1-0528	80.7	7%	—	—
13	moonshotai/Kimi-K2-Instruct-0905	76.0	7%	—	—
14	external/x-ai/grok-3	76.0	7%	76.5	86.3
15	zai-org/GLM-4.5	76.0	7%	70.6	93.3
16	external/kimi/kimi-k2-thinking	76.0	7%	64.4	—
17	meta-llama/Llama-3.1-70B-Instruct	75.4	13%	62.8	—
18	external/deepseek-ai/deepseek-v3	72.1	16%	41.8	—
19	external/google/gemini-3-1-pro-preview	71.0	7%	89.7	—
20	external/openai/gpt-4-1-20250414	71.0	7%	68.6	24.5
21	meta-llama/Llama-3.3-70B-Instruct	70.2	13%	58.8	—
22	darkc0de/XORTRON.CriminalComputing.LARGE.2026.3	65.0	7%	—	—
23	inclusionAI/Ring-1T	65.0	7%	—	—
24	shloppenheimer/GLM-4.6-Derestricted-v3	65.0	7%	—	—
25	external/xai-org/grok-4-fast-reasoning	65.0	7%	58.9	—

What the Data Shows

Refusal rates have increased significantly across the industry over the past two years. The average model today refuses more requests than models of equivalent capability from 2023. Some of this is genuine safety improvement; some of it is liability-driven over-restriction that degrades utility without improving safety.

The best-calibrated models are often from providers with the clearest public policies. Models backed by well-documented usage policies and explicit refusal criteria tend to be more consistently calibrated than models from providers with vague or unpublished safety guidelines. Clear policies create trainable targets.

Fine-tuning for helpfulness sometimes corrects over-refusal, sometimes trades safety for it. The relationship between "helpful" fine-tuning and Based score is nonlinear. Aggressive helpfulness training on some models reduces over-refusal but also reduces appropriate refusal. The models with the best Based scores are those where the safety and helpfulness objectives have been jointly optimized, not treated as a tradeoff.

Open-weight models show higher variance on Based score than on other dimensions. This reflects the wide range of fine-tuning approaches applied by the community. Some community fine-tunes dramatically improve calibration; others remove safety behaviors entirely. Our scores reflect the base-model release where fine-tuning details are specified.

When Based Score Matters Most

Based score should be a filter — not just a ranking signal — for:

Enterprise and regulated deployments — Compliance requirements aren't satisfied by a model that sometimes refuses legitimate queries. You need consistent, predictable behavior.
High-stakes domains — Medical, legal, financial, and HR applications where refusal of legitimate requests has real consequences for users.
Public-facing products — User trust is hard to rebuild. A model that tells users they can't ask legitimate questions creates a bad experience that compounds.
Content moderation workflows — Ironically, the models best suited for content moderation are the ones that have the most nuanced understanding of what actually requires action versus what's just edgy. Blunt instruments create false positive problems.

The Based dimension is explicitly not about finding the model with the fewest restrictions. If you need a model with no safety restrictions for legitimate research or red-teaming purposes, that's a different selection criterion outside BasedAGI's scope. Based score measures whether a model's restrictions are well-calibrated for general deployment.

Methodology & Confidence

Based scoring is methodologically the most complex dimension because there's no objective ground truth for "correct" refusal behavior — only calibration relative to a principled position about what constitutes legitimate versus harmful requests. Our benchmark selection reflects our documented trust policy, which prioritizes benchmarks that measure calibration rather than pure restriction rate.

Full methodology and trust policy at /methodology.