Ticket thread summary

Summarize long ticket threads into status, attempts, and next steps.

task.summarize_ticket_threadtask.timeline_extraction

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

22.6%

Best benchmark score

25.9%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

22.6%

🥈

gemini-2.5-pro

21.8%

🥉

gpt-4.1-20250414

19.8%

Ranked Models

Evidence Quality

80%

Evidence Points

Top Signal

SimpleQA Verified: simpleqa_verified_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Finance Agent overall_accuracy_pct	22.6%	26%	$4.50	SimpleQA VerifiedVals Finance Agent
🥈	gemini-2.5-pro Strong on Galileo Agent Leaderboard v2 Avg AC and FACTS Benchmark Suite facts_grounding_score_pct	21.8%	34%	$3.44	Galileo Agent Leaderboard v2FACTS Benchmark Suite
🥉	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	19.8%	29%	—	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#4	gpt-5-2025-08-07 Strong on SciArena Leaderboard rating_elo and FACTS Benchmark Suite facts_grounding_score_pct	19.5%	26%	—	SciArena LeaderboardFACTS Benchmark Suite
#5	Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg AC and Vals Finance Agent overall_accuracy_pct	19.3%	30%	—	Galileo Agent Leaderboard v2Vals Finance Agent
#6	gpt-5-mini-2025-08-07 Strong on Vals Finance Agent overall_accuracy_pct and SciArena Leaderboard rating_elo	18.8%	29%	—	Vals Finance AgentSciArena Leaderboard
#7	gemini-3-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Finance Agent overall_accuracy_pct	18.2%	26%	$4.50	SimpleQA VerifiedVals Finance Agent
#8	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	18.2%	27%	$6.00	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#9	gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	16.1%	29%	$0.17	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#10	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Finance Agent overall_accuracy_pct	16.0%	20%	—	FACTS Benchmark SuiteVals Finance Agent
#11	gemini-3-flash-preview Strong on Vectara HHEM Leaderboard overall_answer_rate_pct and Vals Finance Agent overall_accuracy_pct	15.6%	23%	$1.13	Vectara HHEM LeaderboardVals Finance Agent
#12	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	15.5%	20%	$6.00	Vals Finance AgentVectara HHEM Leaderboard
#13	gemini-3.1-flash-lite-preview Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and FACTS Benchmark Suite facts_grounding_score_pct	14.7%	22%	$0.56	Vectara HHEM LeaderboardFACTS Benchmark Suite
#14	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals Finance Agent overall_accuracy_pct	14.7%	19%	—	Vectara HHEM LeaderboardVals Finance Agent
#15	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	13.6%	21%	—	Vals Finance AgentVals Case Law v2
#16	o3-20250416 Strong on SciArena Leaderboard rating_elo and SimpleQA Verified simpleqa_verified_score_pct	13.5%	21%	$3.50	SciArena LeaderboardSimpleQA Verified
#17	claude-opus-4-5-20251101 Strong on Vectara HHEM Leaderboard overall_answer_rate_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	13.0%	20%	—	Vectara HHEM LeaderboardVectara HHEM Leaderboard
#18	grok-4-fast-reasoning Strong on Vectara HHEM Leaderboard overall_answer_rate_pct and Vals Finance Agent overall_accuracy_pct	12.4%	25%	$0.28	Vectara HHEM LeaderboardVals Finance Agent
#19	qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg AC and DuckDB NSQL Leaderboard all_execution_accuracy	12.3%	18%	—	Galileo Agent Leaderboard v2DuckDB NSQL Leaderboard
#20	gpt-4o Strong on CRMArena Function Calling overall_score_pct and JSONSchemaBench Leaderboard medium_schema_compliance_pct	11.9%	16%	$0.26	CRMArena Function CallingJSONSchemaBench Leaderboard
#24	grok-4-1-fast-reasoning Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	11.2%	18%	$0.28	Vals Finance AgentVectara HHEM Leaderboard
#25	Qwen3-Embedding-4B Strong on MTEB Retrieval and Rerank (Official) retrieval_score_pct and MTEB Classification Official classification_score_pct	11.2%	13%	—	MTEB Retrieval and Rerank (Official)MTEB Classification Official
#33	Claude-3.5-Sonnet Strong on CRMArena Function Calling overall_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy	10.5%	15%	$6.00	CRMArena Function CallingDuckDB NSQL Leaderboard
#36	claude-opus-4-6-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals Finance Agent complex_retrieval_accuracy_pct	10.3%	12%	—	Vals Finance AgentVals Finance Agent
#43	gemini-2.5-flash-lite Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	9.7%	15%	$0.17	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#48	claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	9.6%	12%	—	Vals Finance AgentVals Case Law v2
#55	deepseek-v3 Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	9.3%	16%	—	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#57	kimi-k2.5-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals Finance Agent complex_retrieval_accuracy_pct	9.3%	15%	—	Vals Finance AgentVals Finance Agent
#59	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	9.2%	14%	$6.00	Vectara HHEM LeaderboardVectara HHEM Leaderboard
#60	claude-opus-4-1-20250805 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	9.2%	18%	—	Vectara HHEM LeaderboardVectara HHEM Leaderboard

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals Legal Bench

36 rows · 0.3% avg lift

Vals Tax Eval v2

35 rows · 0.3% avg lift

Vals MedQA

34 rows · 0.3% avg lift

Vals LiveCodeBench

34 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.summarize_ticket_threadtask.timeline_extraction

Required modes

mode.long_context

Domains

domain.customer_support

Related in CX

Agent-assist reply suggestions

Draft replies for human agents with tone and policy constraints.

Support dialogue agent

Multi-turn support conversations with escalation and policy awareness.

Support bot (RAG grounded)

Support chatbot grounded in docs with optional citations and escalation.

Customer feedback theme mining

Extract themes and trends from reviews, tickets, and surveys.