Menu

Adult

Adult erotica (long-form, explicit)

Long-form explicit erotica with controllable style and strict boundaries.

task.adult_erotica_explicittask.creative_story_longform

Best for this use case

gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

43.6%

Best benchmark score

52.4%

Confidence

All ranked models — top 3

gemini-3-pro-preview

43.6%

Grok-4-0709

42.9%

grok-4-1-fast-reasoning

39.6%

Ranked Models

30

Evidence Quality

87%

Evidence Points

24

Top Signal

BFCL Memory Official: Memory Acc

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3-pro-preview Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	43.6%	52%	$4.50	BFCL Memory OfficialBFCL Multi-turn Official
🥈	Grok-4-0709 Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	42.9%	56%	—	BFCL Memory OfficialBFCL Multi-turn Official
🥉	grok-4-1-fast-reasoning Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	39.6%	51%	$0.28	BFCL Memory OfficialBFCL Multi-turn Official
#4	GLM-4.6 Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	37.0%	43%	—	BFCL Memory OfficialBFCL Multi-turn Official
#5	o3-20250416 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	35.4%	53%	$3.50	BFCL Memory OfficialBFCL Relevance Detection Official
#6	gpt-4.1-20250414 Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	32.4%	57%	—	BFCL Relevance Detection OfficialBFCL Memory Official
#7	Kimi-K2-Instruct Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc	30.8%	45%	—	BFCL Multi-turn OfficialBFCL Memory Official
#8	o4-mini Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	28.4%	52%	$1.93	BFCL Memory OfficialBFCL Relevance Detection Official
#9	grok-4-1-fast-non-reasoning Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc	28.4%	49%	$0.28	BFCL Multi-turn OfficialBFCL Memory Official
#10	gpt-5.2-2025-12-11 Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	27.8%	51%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#11	gemini-2.5-flash Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	24.9%	44%	$0.17	BFCL Memory OfficialBFCL Relevance Detection Official
#16	claude-opus-4-5-20251101 Strong on UGI Leaderboard Writing ✍️ and BFCL Relevance Detection Official Relevance Detection	22.4%	51%	—	UGI LeaderboardBFCL Relevance Detection Official
#24	gpt-5-2025-08-07 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	18.4%	23%	—	UGI LeaderboardUGI Leaderboard
#25	Arch-Agent-32B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	18.2%	33%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#27	gemini-2.5-pro Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	17.9%	25%	$3.44	UGI LeaderboardUGI Leaderboard
#28	claude-sonnet-4 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	17.6%	25%	$6.00	UGI LeaderboardUGI Leaderboard
#29	gemini-3.1-pro-preview Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	17.5%	20%	$4.50	UGI LeaderboardUGI Leaderboard
#30	Llama 3.3 70B Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc	17.3%	49%	—	BFCL Relevance Detection OfficialBFCL Multi-turn Official
#38	gpt-5.4-2026-03-05 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	15.2%	18%	—	UGI LeaderboardUGI Leaderboard
#40	claude-sonnet-4.6 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	15.0%	18%	$6.00	UGI LeaderboardUGI Leaderboard
#43	gemini-3-flash-preview Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	14.7%	19%	$1.13	UGI LeaderboardUGI Leaderboard
#45	kimi-k2.5-thinking Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	14.3%	18%	—	UGI LeaderboardUGI Leaderboard
#53	Llama-4-Scout-17B-16E-Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	13.3%	40%	—	BFCL Relevance Detection OfficialBFCL Memory Official
#55	gemini-2.5-flash-lite Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	13.1%	40%	$0.17	BFCL Memory OfficialBFCL Relevance Detection Official
#56	gpt-5.1-2025-11-13 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	13.1%	18%	—	UGI LeaderboardUGI Leaderboard
#58	qwen-2.5-72b-instruct Strong on EQ-Bench Leaderboard judgemark_score and Galileo Agent Leaderboard v2 Avg AC	13.0%	26%	—	EQ-Bench LeaderboardGalileo Agent Leaderboard v2
#63	grok-4-fast-reasoning Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	12.5%	20%	$0.28	UGI LeaderboardUGI Leaderboard
#66	Arch-Agent-3B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	12.5%	33%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#69	Kimi K2 Thinking Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	12.1%	16%	$1.07	UGI LeaderboardUGI Leaderboard
#71	Arch-Agent-1.5B Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc	11.8%	33%	—	BFCL Relevance Detection OfficialBFCL Multi-turn Official

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

47

Sources

8

Quality

Good

UGI Leaderboard

36 rows · 1.9% avg lift

Vals LiveCodeBench

25 rows · 0.4% avg lift

Vals Legal Bench

24 rows · 0.4% avg lift

Vals Tax Eval v2

24 rows · 0.4% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.adult_erotica_explicittask.creative_story_longform

Required modes

mode.persona_memorymode.long_context

Domains

domain.creative_writing

Related in Adult

Adult ERP roleplay (explicit)

Explicit adult roleplay with boundary adherence and persona memory.

🥇Grok-4-0709