BasedAGIBasedAGI
Creative

Lore bible generator

Create consistent lore references (timelines, factions, glossaries).

task.worldbuilding_lore_bibletask.json_schema_filling
Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gpt-4.1-20250414

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

20.1%

Best benchmark score

30.6%

Confidence

Ranked Models

30

Evidence Quality

80%

Evidence Points

21

Top Signal

MMLongBench-Doc Leaderboard: acc_score_pct

All Ranked Models

30 of 30 models
RankModelScore
๐Ÿฅ‡gpt-4.1-20250414

Strong on MMLongBench-Doc Leaderboard acc_score_pct and UGI Leaderboard Writing โœ๏ธ

20.1%
๐Ÿฅˆgpt-5-2025-08-07

Strong on UGI Leaderboard Writing โœ๏ธ and ExtractBench Paper Baselines credit_pass_rate_pct

19.9%
๐Ÿฅ‰Grok-4-0709

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

19.7%
#4gemini-3-pro-preview

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Writing โœ๏ธ

18.6%
#5gemini-2.5-pro

Strong on UGI Leaderboard Writing โœ๏ธ and MWS Vision Bench validation_overall_score

18.5%
#6qwen-2.5-72b-instruct

Strong on EQ-Bench Leaderboard judgemark_score and DuckDB NSQL Leaderboard all_execution_accuracy

17.5%
#7gpt-4o

Strong on EQ-Bench Leaderboard judgemark_score and JSONSchemaBench Leaderboard medium_schema_compliance_pct

16.6%
#8o3-20250416

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

16.5%
#9claude-sonnet-4

Strong on UGI Leaderboard Writing โœ๏ธ and Galileo Agent Leaderboard v2 Avg AC

16.3%
#10gemini-3.1-pro-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

16.0%
#11gemini-3-flash-preview

Strong on UGI Leaderboard Writing โœ๏ธ and MWS Vision Bench validation_overall_score

15.4%
#12gpt-5.2-2025-12-11

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Writing โœ๏ธ

15.2%
#13grok-4-1-fast-reasoning

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Writing โœ๏ธ

15.0%
#15gpt-5.4-2026-03-05

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

13.6%
#16claude-sonnet-4.6

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

13.4%
#17claude-opus-4-5-20251101

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

13.2%
#19kimi-k2.5-thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.6%
#22o4-mini

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.2%
#23Kimi-K2-Instruct

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Entertainment

12.1%
#24gpt-5.1-2025-11-13

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.0%
#25gemini-2.5-flash

Strong on MWS Vision Bench validation_overall_score and Galileo Agent Leaderboard v2 Avg TSQ

11.8%
#26gpt-5-mini-2025-08-07

Strong on MWS Vision Bench validation_overall_score and Vals MedQA overall_accuracy_pct

11.7%
#33GLM-4.6

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Entertainment

11.2%
#34grok-4-fast-reasoning

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

11.2%
#36grok-4-1-fast-non-reasoning

Strong on BFCL Multi-turn Official Multi Turn Acc and UGI Leaderboard Writing โœ๏ธ

11.0%
#39Kimi K2 Thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

10.6%
#43grok-3

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

10.0%
#46gpt-4o-20241120

Strong on MMLongBench-Doc Leaderboard acc_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy

9.7%
#47claude-opus-4

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

9.5%
#48claude-opus-4-1-20250805

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

9.5%

Compare Models

Select two different models above to compare their evidence side by side.
โ–ถRanking diagnostics & missing models

Source lift

Ranked

48

Sources

8

Quality

Low

UGI Leaderboard

36 rows ยท 1.4% avg lift

Vals Legal Bench

28 rows ยท 0.4% avg lift

Vals MedQA

26 rows ยท 0.4% avg lift

Vals Tax Eval v2

26 rows ยท 0.4% avg lift

Missing frontier models

No obvious gaps right now.

โ–ถTaxonomy & task details

Core tasks

task.worldbuilding_lore_bibletask.json_schema_filling

Required modes

mode.long_contextmode.json_schema

Domains

domain.creative_writing

Related in Creative