BasedAGIBasedAGI
Companion

Empathetic support chat

Supportive conversation with strong boundaries and safe escalation.

task.empathy_support_dialoguetask.session_memory_consistency

Best for this use case

gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

48.7%

Best benchmark score

58.8%

Confidence

Ranked Models

30

Evidence Quality

88%

Evidence Points

24

Top Signal

BFCL Memory Official: Memory Acc

All Ranked Models

30 of 30 models
RankModelScore
๐Ÿฅ‡gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

48.7%
๐ŸฅˆGrok-4-0709

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

48.3%
๐Ÿฅ‰grok-4-1-fast-reasoning

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

45.0%
#4GLM-4.6

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

42.0%
#5o3-20250416

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

40.7%
#6gpt-4.1-20250414

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

35.6%
#7Kimi-K2-Instruct

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc

35.3%
#8o4-mini

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

33.6%
#9grok-4-1-fast-non-reasoning

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

33.0%
#10gpt-5.2-2025-12-11

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

31.9%
#12gemini-2.5-flash

Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection

30.2%
#18claude-opus-4-5-20251101

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Relevance Detection Official Irrelevance Detection

26.0%
#24Arch-Agent-32B

Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection

23.1%
#28Llama 3.3 70B Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

21.7%
#44gpt-5-2025-08-07

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

18.0%
#46Llama-4-Scout-17B-16E-Instruct

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc

18.0%
#50gemini-2.5-pro

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

17.5%
#52claude-sonnet-4

Strong on UGI Leaderboard Writing โœ๏ธ and Galileo Agent Leaderboard v2 Avg AC

17.3%
#54gemini-2.5-flash-lite

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Relevance Detection Official Irrelevance Detection

17.3%
#55gemini-3.1-pro-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

17.0%
#61Arch-Agent-3B

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

16.5%
#62Arch-Agent-1.5B

Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc

16.2%
#66gpt-5.4-2026-03-05

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.6%
#68claude-sonnet-4.6

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.4%
#71gemini-3-flash-preview

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

14.2%
#75kimi-k2.5-thinking

Strong on UGI Leaderboard Entertainment and UGI Leaderboard Writing โœ๏ธ

13.7%
#83gpt-5.1-2025-11-13

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.7%
#89grok-4-fast-reasoning

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

12.0%
#93Kimi K2 Thinking

Strong on UGI Leaderboard Writing โœ๏ธ and UGI Leaderboard Entertainment

11.5%
#100grok-3

Strong on UGI Leaderboard Entertainment and UGI Leaderboard Writing โœ๏ธ

10.8%

Compare Models

Select two different models above to compare their evidence side by side.
โ–ถRanking diagnostics & missing models

Source lift

Ranked

45

Sources

8

Quality

Good

UGI Leaderboard

34 rows ยท 1.7% avg lift

Vals Legal Bench

25 rows ยท 0.5% avg lift

Vals LiveCodeBench

25 rows ยท 0.4% avg lift

Vals MedQA

24 rows ยท 0.5% avg lift

Missing frontier models

No obvious gaps right now.

โ–ถTaxonomy & task details

Core tasks

task.empathy_support_dialoguetask.session_memory_consistency

Required modes

mode.persona_memory

Domains

domain.general_business

Related in Companion