BasedAGIBasedAGI
Adult

Adult ERP roleplay (explicit)

Explicit adult roleplay with boundary adherence and persona memory.

task.adult_erotica_explicittask.persona_consistency

Best for this use case

Grok-4-0709

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

49.5%

Best benchmark score

63.2%

Confidence

Ranked Models

30

Evidence Quality

85%

Evidence Points

25

Top Signal

BFCL Memory Official: Memory Acc

All Ranked Models

30 of 30 models
RankModelScore
๐Ÿฅ‡Grok-4-0709

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

49.5%
๐Ÿฅˆgemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

49.3%
๐Ÿฅ‰grok-4-1-fast-reasoning

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

44.7%
#4o3-20250416

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

41.3%
#5GLM-4.6

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

41.3%
#6gpt-4.1-20250414

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

35.6%
#7Kimi-K2-Instruct

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

34.6%
#8o4-mini

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

33.0%
#9grok-4-1-fast-non-reasoning

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

31.9%
#10gpt-5.2-2025-12-11

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

31.3%
#11gemini-2.5-flash

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

28.8%
#15claude-opus-4-5-20251101

Strong on UGI Leaderboard Writing โœ๏ธ and BFCL Relevance Detection Official Relevance Detection

25.8%
#22gpt-5-2025-08-07

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

21.8%
#24gemini-2.5-pro

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

21.2%
#25claude-sonnet-4

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

20.8%
#26gemini-3.1-pro-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

20.7%
#28Llama 3.3 70B Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

19.7%
#29Arch-Agent-32B

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

19.6%
#35gpt-5.4-2026-03-05

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

18.0%
#36claude-sonnet-4.6

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

17.7%
#38gemini-3-flash-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

17.4%
#41kimi-k2.5-thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

16.9%
#53gpt-5.1-2025-11-13

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

15.5%
#54Llama-4-Scout-17B-16E-Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

15.4%
#55gemini-2.5-flash-lite

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

15.3%
#56grok-4-fast-reasoning

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.8%
#61Kimi K2 Thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.3%
#68grok-3

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

13.6%
#69Arch-Agent-3B

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

13.5%
#70qwen-2.5-72b-instruct

Strong on EQ-Bench Leaderboard judgemark_score and Galileo Agent Leaderboard v2 Avg AC

13.3%

Compare Models

Select two different models above to compare their evidence side by side.
โ–ถRanking diagnostics & missing models

Source lift

Ranked

67

Sources

8

Quality

Good

UGI Leaderboard

44 rows ยท 2.2% avg lift

Vals Legal Bench

36 rows ยท 0.5% avg lift

Vals MedQA

35 rows ยท 0.5% avg lift

Vals LiveCodeBench

35 rows ยท 0.5% avg lift

Missing frontier models

No obvious gaps right now.

โ–ถTaxonomy & task details

Core tasks

task.adult_erotica_explicittask.persona_consistency

Required modes

mode.persona_memory

Domains

domain.creative_writing

Related in Adult