Transparency

How we rank models

Every score on BasedAGI is derived from public benchmark data. No vibes, no sponsors, no pay-to-rank. This page explains exactly how models are scored, ranked, and surfaced.

The Big Picture

BasedAGI maps real-world workflows (use cases) to the LLMs best suited to them. Each use case defines which benchmarks matter and how much weight each metric carries. Models are scored purely from benchmark measurements — no manual curation, no hidden adjustments.

The result: a use-case-specific score for every model, backed by traceable evidence you can inspect down to the individual benchmark row.

Scoring Pipeline

Benchmark Ingestion

We ingest structured data from public benchmark leaderboards. Each source has a trust tier, refresh SLA, and reliability weight. Snapshots are captured periodically and deduplicated.

Model Matching

Raw benchmark rows are matched to canonical model identities using fuzzy matching on name, author, and parameter count. Match rates are tracked per source.

Metric Normalization

Raw metric values are normalized to a 0–1 scale within each benchmark metric. This makes scores comparable across benchmarks with different scales.

Weight Application

Each use case has a set of benchmark-metric weights that reflect which capabilities matter. Weights are defined in versioned presets.

Score Aggregation

The final use-case score is a weighted average: Σ(normalized_value × weight) / Σ(weight). Confidence is derived from evidence coverage and weight diversity.

Ranking & Quality Check

Models are ranked by score. Each use case gets an evidence quality assessment based on ranked model count, average evidence, and coverage ratio.

Scoring Formula

Use-Case Score = Σ(value_normalized × weight) / Σ(weight)

value_normalized

The model’s metric score scaled to 0–1 within that benchmark.

weight

How much this benchmark metric matters for this use case.

confidence

Derived from evidence count and weight coverage. Higher = more data.

Utility Score (Global Ranking)

The Utility Score shown on the Model Rankings page is a cross-use-case aggregate:

Utility Score = Σ(use_case_score × confidence) / Σ(confidence)

This confidence-weighted average rewards models that perform well on use cases where the evidence is strong, and discounts scores backed by thin evidence.

Why Confidence-Weighted Scoring

Confidence-weighted scoring is more robust than vote-based leaderboards because it preserves uncertainty instead of flattening every judgment into a binary win or loss. A vote-only system rewards marginal, noisy advantages exactly the same way it rewards decisive ones, which makes rankings more sensitive to matchup variance, sample composition, and evaluator noise. By weighting contributions by confidence, the scoring layer can treat weak evidence as weak evidence, so a model climbs because it wins clearly and consistently, not because it accumulates a thin edge across many close calls.

This matters because leaderboards are vulnerable to distortion long before the underlying model quality actually changes. The Leaderboard Illusion argues that benchmark ecosystems can be skewed by selective disclosure and private testing practices, producing rankings that overstate real differences between systems. Confidence-weighted scoring is a practical countermeasure because it reduces the leverage of brittle wins: if the evidence behind an apparent lead is sparse, inconsistent, or low-confidence, the ranking impact should be correspondingly smaller. That makes the scoring surface harder to game and less likely to mistake benchmark noise for genuine product advantage.

For a model-routing product, confidence-weighted scoring also produces better explanations. Users do not just need to know which model ranked first; they need to know whether the margin is trustworthy for their use case. A confidence-aware rank can separate high-signal leaders from provisional leaders, which is materially better than a vote-based table that implies false precision. Inference from The Leaderboard Illusion: if benchmark governance problems can inflate public leaderboard positions, then ranking systems should explicitly encode evidence strength, coverage, and uncertainty rather than treating every comparison as equally reliable.

Citation: The Leaderboard Illusion, arXiv:2504.20879 (Apr 2025)

The Five Intelligence Dimensions

Reasoning, math, and STEM problem-solving.

GPQA, MMLU-Pro, MathArena, GAIA, BizFinBench.

Emotional and social intelligence.

EQ-Bench, theory-of-mind benchmarks, CRMArena.

Accuracy

Factual correctness, code execution, and structured output reliability.

SWE-bench, BFCL, hallucination benchmarks, HELM ECE calibration.

Creativity

Creative and generative quality.

EQ-Bench creative writing, Judgemark, UGI.

Based

Epistemic honesty, sycophancy resistance, and safety behavior.

TruthfulQA, SYCON Bench, OR-Bench, SycEval.

Benchmark Sources

Total Sources

195

Active

180

Weighted

108

Use Cases

143+

Trust Tiers

Each benchmark source receives a reliability_weight from 0.0 to 1.0. In practice, peer-reviewed papers usually land around 0.75-0.90, official leaderboards around 0.80-0.92, and community leaderboards around 0.65-0.80.

Those weights are enforced through the governance pipeline exposed by /ops/source-governance. Sources below 0.7 still contribute, but they contribute materially less to final rankings than higher-trust sources.

Primary (18)

Well-established benchmarks with regular updates, high model coverage, and structured data feeds.

Secondary (159)

Good coverage but less frequent updates or narrower model selection.

Experimental (3)

New or niche benchmarks under evaluation. Lower weight until proven reliable.

Evidence Quality

Each use case receives an evidence quality grade based on three factors:

Ranked Model Count

How many models have enough weighted benchmark data to receive a score.

Average Evidence

Mean number of benchmark evidence points per ranked model. Higher = more trustworthy.

Coverage Ratio

What fraction of expected benchmark metrics actually have data for ranked models.

Use cases are graded Sufficient, Insufficient, or Unscored. Insufficient rankings still appear but display a warning banner.

What We Don't Do

×No manual curation or editorial picks
×No sponsored placements or pay-to-rank
×No vibes-based or anecdotal scoring
×No hidden adjustments or secret weights
×No self-reported model performance data

Browse Use Cases Model Rankings