NPC dialogue
Low-latency in-character dialogue suitable for games.
Best for this use case
gemini-3-pro-preview
Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc
44.2%
Best benchmark score
53.2%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
85%
Evidence Points
24
Top Signal
BFCL Memory Official: Memory Acc
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gemini-3-pro-preview Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc | 44.2% |
| ๐ฅ | Grok-4-0709 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection | 44.1% |
| ๐ฅ | grok-4-1-fast-reasoning Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc | 40.0% |
| #4 | o3-20250416 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection | 37.0% |
| #5 | GLM-4.6 Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc | 36.4% |
| #6 | gpt-4.1-20250414 Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc | 32.0% |
| #7 | Kimi-K2-Instruct Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc | 30.7% |
| #8 | o4-mini Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection | 29.8% |
| #9 | gemini-2.5-flash Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection | 28.9% |
| #10 | grok-4-1-fast-non-reasoning Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc | 28.7% |
| #11 | gpt-5.2-2025-12-11 Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection | 28.6% |
| #15 | claude-opus-4-5-20251101 Strong on BFCL Relevance Detection Official Relevance Detection and UGI Leaderboard Writing โ๏ธ | 23.3% |
| #18 | gemini-2.5-pro Strong on UGI Leaderboard Writing โ๏ธ and MWS Vision Bench validation_overall_score | 21.3% |
| #24 | gpt-5-2025-08-07 Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 19.4% |
| #26 | claude-sonnet-4 Strong on UGI Leaderboard Writing โ๏ธ and Galileo Agent Leaderboard v2 Avg AC | 18.7% |
| #27 | gemini-3.1-pro-preview Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 18.4% |
| #28 | Arch-Agent-32B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection | 18.3% |
| #30 | Llama 3.3 70B Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc | 17.9% |
| #31 | gemini-3-flash-preview Strong on UGI Leaderboard Writing โ๏ธ and MWS Vision Bench validation_overall_score | 17.7% |
| #39 | qwen-2.5-72b-instruct Strong on EQ-Bench Leaderboard judgemark_score and Galileo Agent Leaderboard v2 Avg AC | 15.8% |
| #41 | gpt-5.4-2026-03-05 Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 15.7% |
| #45 | claude-sonnet-4.6 Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 15.4% |
| #52 | Llama-4-Scout-17B-16E-Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc | 14.6% |
| #54 | kimi-k2.5-thinking Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 14.5% |
| #56 | gemini-2.5-flash-lite Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc | 14.4% |
| #58 | gpt-4o Strong on EQ-Bench Leaderboard judgemark_score and MEGA-Bench overall_score | 14.2% |
| #59 | gpt-5.1-2025-11-13 Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 13.7% |
| #63 | gpt-5-mini-2025-08-07 Strong on MWS Vision Bench validation_overall_score and Vals MedQA overall_accuracy_pct | 13.5% |
| #72 | grok-4-fast-reasoning Strong on UGI Leaderboard Writing โ๏ธ and UGI Leaderboard Entertainment | 12.9% |
| #73 | Arch-Agent-3B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection | 12.7% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
56
Sources
8
Quality
Good
UGI Leaderboard
Vals Legal Bench
Vals LiveCodeBench
Vals MedQA
Missing frontier models
No obvious gaps right now.
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Creative
Interactive fiction / DM
Run interactive fiction with state tracking and user agency.
SFW roleplay and simulation
Roleplay/simulations for learning or entertainment with state tracking.
Poetry and lyrics
Generate poems and lyrics with style control and variation.
Screenplay scene writing
Write screenplay scenes with formatting, pacing, and strong dialogue.