Hallucination is one of the most expensive failure modes in production AI systems. Not because it's hard to detect when it's obvious — a model that confidently states the wrong date of a historical event is easy to catch. The problem is confident, plausible, nearly-correct hallucination: a citation that almost exists, a legal clause that almost says what the model claims, a medical interaction that almost matches what was studied. These failures are expensive, and they happen in proportion to how fluent and confident the model sounds.
Accuracy score measures resistance to this. Not just "does it get facts right on average" but specifically how well it avoids generating plausible-sounding false content — which is harder to benchmark and more important to get right.
The Hallucination Problem in Depth
There are fundamentally different failure modes that all get lumped under "hallucination":
Fabrication — The model invents information that doesn't exist: a paper that was never written, a person who never held a position, a statistic with no source. This is the obvious failure mode and the easiest to catch in evaluation.
Confabulation — The model accurately retrieves a real thing (a person, a study, an event) but attributes incorrect properties to it: the right author, wrong paper. The right company, wrong acquisition date. This is harder to detect because everything around the error is correct.
Conflation — The model merges two real things into one: two similar studies become one study with mixed results, two products become a hybrid product with features from both. The individual facts exist somewhere; the combination doesn't.
Grounding failure in RAG — When operating on provided documents, the model generates answers that aren't supported by the source material, or subtly overstates what the source says. This is critical for enterprise RAG deployments.
The models that score highest on factual accuracy are often not the ones with the highest general benchmark scores. There's a real accuracy-fluency tradeoff — more fluent models sometimes generate more plausible-sounding errors. Accuracy benchmarks specifically measure the failure mode that high fluency enables.
How We Score Accuracy
The Accuracy dimension aggregates across grounding and hallucination-specific benchmarks:
FACTS Grounding evaluates whether model responses are factually grounded in provided source material. It's designed to directly measure the RAG failure mode — does the model stick to what it's been given, or does it augment with invented content?
Vectara HHEM (Hughes Hallucination Evaluation Model) measures hallucination rate across summarization tasks — a particularly hallucination-prone task type where models are tempted to "fill in" details not present in the source.
SimpleQA is OpenAI's factual grounding benchmark: direct factual questions with clear, verifiable answers. Models that score well here have genuinely strong factual recall, not just good pattern-matching.
HLE (Humanity's Last Exam) is the hardest factual benchmark currently available — expert-level questions across disciplines where surface-level pattern matching is useless. High HLE scores are a strong signal of deep factual grounding.
Current Rankings
Accuracy Rankings
Factual reliability & hallucination resistance
What the Data Shows
Accuracy and IQ are correlated but not redundant. Models with higher reasoning ability tend to hallucinate less — reasoning helps catch and correct internally inconsistent statements. But the correlation is imperfect, and there are notable outliers in both directions.
Fine-tuning for helpfulness can hurt accuracy. Models trained aggressively to be helpful and to always produce a response sometimes learn to produce plausible-sounding content rather than admit uncertainty. The best-calibrated models are often the ones explicitly trained to say "I don't know" — which turns out to be a teachable behavior.
RAG doesn't solve hallucination, it moves it. Models that score poorly on accuracy in free-form generation often score poorly on FACTS Grounding too. The underlying tendency to generate unsupported content persists even when source material is provided; the model just hallucinates from the document instead of from training data.
Citation behavior is a weak proxy. A model that generates citations hallucinates less on average, but only slightly. The relationship between "produces citations" and "produces accurate citations" is unreliable enough that citation presence isn't a meaningful signal.
When Accuracy Is the Right Signal
Accuracy should be your primary dimension when:
-
Legal document analysis and contract review — Factual errors in legal context have real consequences. A model that misreads what a clause says, or invents a precedent, is worse than useless.
-
Medical and clinical information — Accuracy here is safety-critical. Drug interactions, contraindications, dosing — this is the domain where hallucination is most dangerous.
-
Financial analysis and filings — Earnings figures, regulatory language, compliance requirements. Wrong numbers in financial context are expensive.
-
RAG and document Q&A systems — If your use case involves querying documents for factual information, accuracy is your primary signal. A model that can't stay grounded in provided sources will erode user trust rapidly.
-
Fact-checking and research verification — Any workflow where the output will be relied upon as factually correct, not just plausible.
Accuracy and Creativity are the most strongly inversely correlated dimensions. Models optimized for creative, expressive output tend to score lower on factual accuracy. Don't optimize for both simultaneously — pick the one that matches your use case.
Methodology & Confidence
Accuracy scores weight hallucination-specific benchmarks more heavily than general knowledge benchmarks. A model that performs well on general QA but poorly on grounding tasks will score lower than one with the reverse profile — grounding failure is the production failure mode we're most concerned with.
Full methodology at /methodology.