Multilingual Customer Support

Handling customer queries in multiple languages with cultural awareness.

task.customer_support_dialoguetask.translate_general

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

claude-sonnet-4

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

25.5%

Best benchmark score

35.9%

Confidence

All ranked models — top 3

🥇

claude-sonnet-4

25.5%

🥈

gemini-2.5-flash

25.0%

🥉

gemini-3.1-pro-preview

24.2%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

LanguageBench: overall:mean

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-sonnet-4 Strong on LanguageBench overall:mean and LanguageBench Translation Official (Split) translation_to:bleu	25.5%	36%	$6.00	LanguageBenchLanguageBench Translation Official (Split)
🥈	gemini-2.5-flash Strong on LanguageBench overall:mean and FACTS Benchmark Suite facts_grounding_score_pct	25.0%	36%	$0.17	LanguageBenchFACTS Benchmark Suite
🥉	gemini-3.1-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Finance Agent overall_accuracy_pct	24.2%	28%	$4.50	SimpleQA VerifiedVals Finance Agent
#4	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	22.7%	45%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
#5	gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Galileo Agent Leaderboard v2 Avg AC	21.0%	31%	—	Vectara HHEM LeaderboardGalileo Agent Leaderboard v2
#6	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Finance Agent overall_accuracy_pct	20.4%	28%	—	FACTS Benchmark SuiteVals Finance Agent
#7	gpt-5-mini-2025-08-07 Strong on Vals Finance Agent overall_accuracy_pct and Vals Finance Agent complex_retrieval_accuracy_pct	20.3%	32%	—	Vals Finance AgentVals Finance Agent
#8	Claude-3.5-Sonnet Strong on LanguageBench overall:mean and LanguageBench Translation Official (Split) translation_to:bleu	19.8%	26%	$6.00	LanguageBenchLanguageBench Translation Official (Split)
#9	Grok-4-0709 Strong on Vals Finance Agent overall_accuracy_pct and SimpleQA Verified simpleqa_verified_score_pct	17.7%	26%	—	Vals Finance AgentSimpleQA Verified
#10	gemini-3-flash-preview Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	17.4%	25%	$1.13	Vals Finance AgentVectara HHEM Leaderboard
#11	gemini-3-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Finance Agent overall_accuracy_pct	17.3%	25%	$4.50	SimpleQA VerifiedVals Finance Agent
#13	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	16.5%	22%	$6.00	Vals Finance AgentVectara HHEM Leaderboard
#14	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	16.5%	24%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#15	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Finance Agent overall_accuracy_pct	16.1%	19%	—	FACTS Benchmark SuiteVals Finance Agent
#16	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals Finance Agent overall_accuracy_pct	15.3%	20%	—	Vectara HHEM LeaderboardVals Finance Agent
#17	gpt-4.1 Strong on LanguageBench overall:mean and LanguageBench Translation Official (Split) translation_to:bleu	14.2%	17%	$3.50	LanguageBenchLanguageBench Translation Official (Split)
#18	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	14.1%	21%	—	FACTS Benchmark SuiteVectara HHEM Leaderboard
#20	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals Finance Agent complex_retrieval_accuracy_pct	13.7%	22%	—	Vals Finance AgentVals Finance Agent
#22	grok-4-fast-reasoning Strong on Vectara HHEM Leaderboard overall_answer_rate_pct and Vals Finance Agent overall_accuracy_pct	13.6%	27%	$0.28	Vectara HHEM LeaderboardVals Finance Agent
#23	gemini-2.0-flash-001 Strong on LanguageBench overall:mean and LanguageBench Translation Official (Split) translation_to:bleu	13.4%	18%	—	LanguageBenchLanguageBench Translation Official (Split)
#27	Qwen3-Embedding-4B Strong on MTEB STS & Summarization Proxy Official sts_score_pct and MTEB Retrieval and Rerank (Official) retrieval_score_pct	13.0%	15%	—	MTEB STS & Summarization Proxy OfficialMTEB Retrieval and Rerank (Official)
#30	o3-20250416 Strong on SciArena Leaderboard rating_elo and SimpleQA Verified simpleqa_verified_score_pct	12.6%	20%	$3.50	SciArena LeaderboardSimpleQA Verified
#35	gpt-4.1-mini-20250414 Strong on OpenVLM MTVQA Official mtvqa_score_pct and Galileo Agent Leaderboard v2 Avg AC	12.0%	16%	—	OpenVLM MTVQA OfficialGalileo Agent Leaderboard v2
#57	grok-4-1-fast-reasoning Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	10.8%	17%	$0.28	Vals Finance AgentVectara HHEM Leaderboard
#64	phi-4 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and LanguageBench overall:mean	10.4%	18%	—	Vectara HHEM LeaderboardLanguageBench
#65	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	10.4%	15%	$6.00	Vectara HHEM LeaderboardVectara HHEM Leaderboard
#67	claude-opus-4-6-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals Finance Agent complex_retrieval_accuracy_pct	10.3%	12%	—	Vals Finance AgentVals Finance Agent
#88	kimi-k2.5-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	9.7%	15%	—	Vals Finance AgentVals CorpFin v2
#93	claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	9.5%	12%	—	Vals Finance AgentVals CorpFin v2
#97	Llama-3.3-70B-Instruct Strong on LanguageBench overall:mean and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	9.3%	13%	—	LanguageBenchLanguageBench Grammar/Clarity Official (Split)

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

42 rows · 0.9% avg lift

Vals Tax Eval v2

32 rows · 0.2% avg lift

Vals Finance Agent

31 rows · 1.4% avg lift

Vals Legal Bench

31 rows · 0.2% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.customer_support_dialoguetask.translate_general

Required modes

none

Domains

domain.customer_support

Related in CX

Agent-assist reply suggestions

Draft replies for human agents with tone and policy constraints.

Support dialogue agent

Multi-turn support conversations with escalation and policy awareness.

Support bot (RAG grounded)

Support chatbot grounded in docs with optional citations and escalation.

Customer feedback theme mining

Extract themes and trends from reviews, tickets, and surveys.