BasedAGIBasedAGI
Creative

NPC dialogue

Low-latency in-character dialogue suitable for games.

task.dialogue_character_voicetask.persona_consistency

Best for this use case

gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

44.2%

Best benchmark score

53.2%

Confidence

Ranked Models

30

Evidence Quality

85%

Evidence Points

24

Top Signal

BFCL Memory Official: Memory Acc

All Ranked Models

30 of 30 models
RankModelScore
๐Ÿฅ‡gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

44.2%
๐ŸฅˆGrok-4-0709

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

44.1%
๐Ÿฅ‰grok-4-1-fast-reasoning

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

40.0%
#4o3-20250416

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

37.0%
#5GLM-4.6

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

36.4%
#6gpt-4.1-20250414

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

32.0%
#7Kimi-K2-Instruct

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc

30.7%
#8o4-mini

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

29.8%
#9gemini-2.5-flash

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

28.9%
#10grok-4-1-fast-non-reasoning

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc

28.7%
#11gpt-5.2-2025-12-11

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

28.6%
#15claude-opus-4-5-20251101

Strong on BFCL Relevance Detection Official Relevance Detection and UGI Leaderboard Writing โœ๏ธ

23.3%
#18gemini-2.5-pro

Strong on UGI Leaderboard Writing โœ๏ธ and MWS Vision Bench validation_overall_score

21.3%
#24gpt-5-2025-08-07

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

19.4%
#26claude-sonnet-4

Strong on UGI Leaderboard Writing โœ๏ธ and Galileo Agent Leaderboard v2 Avg AC

18.7%
#27gemini-3.1-pro-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

18.4%
#28Arch-Agent-32B

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

18.3%
#30Llama 3.3 70B Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

17.9%
#31gemini-3-flash-preview

Strong on UGI Leaderboard Writing โœ๏ธ and MWS Vision Bench validation_overall_score

17.7%
#39qwen-2.5-72b-instruct

Strong on EQ-Bench Leaderboard judgemark_score and Galileo Agent Leaderboard v2 Avg AC

15.8%
#41gpt-5.4-2026-03-05

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

15.7%
#45claude-sonnet-4.6

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

15.4%
#52Llama-4-Scout-17B-16E-Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

14.6%
#54kimi-k2.5-thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.5%
#56gemini-2.5-flash-lite

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

14.4%
#58gpt-4o

Strong on EQ-Bench Leaderboard judgemark_score and MEGA-Bench overall_score

14.2%
#59gpt-5.1-2025-11-13

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

13.7%
#63gpt-5-mini-2025-08-07

Strong on MWS Vision Bench validation_overall_score and Vals MedQA overall_accuracy_pct

13.5%
#72grok-4-fast-reasoning

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.9%
#73Arch-Agent-3B

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

12.7%

Compare Models

Select two different models above to compare their evidence side by side.
โ–ถRanking diagnostics & missing models

Source lift

Ranked

56

Sources

8

Quality

Good

UGI Leaderboard

35 rows ยท 1.6% avg lift

Vals Legal Bench

32 rows ยท 0.5% avg lift

Vals LiveCodeBench

32 rows ยท 0.5% avg lift

Vals MedQA

31 rows ยท 0.5% avg lift

Missing frontier models

No obvious gaps right now.

โ–ถTaxonomy & task details

Core tasks

task.dialogue_character_voicetask.persona_consistency

Required modes

mode.low_latencymode.persona_memory

Domains

domain.gaming_npc_dialogue

Related in Creative