Supply Chain

Route alternatives planning

Propose alternative routes and tradeoffs under disruptions.

task.planning_task_breakdowntask.tradeoff_analysis

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

claude-sonnet-4

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

24.2%

Best benchmark score

32.0%

Confidence

All ranked models — top 3

🥇

claude-sonnet-4

24.2%

🥈

gpt-5-2025-08-07

22.0%

🥉

gemini-2.5-pro

21.8%

Ranked Models

Evidence Quality

80%

Evidence Points

Top Signal

Galileo Agent Leaderboard v2: Avg TSQ

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg TSQ and LanguageBench overall:mean	24.2%	32%	$6.00	Galileo Agent Leaderboard v2LanguageBench
🥈	gpt-5-2025-08-07 Strong on SciArena Leaderboard rating_elo and Aider Polyglot Leaderboard percent_correct_pct	22.0%	30%	—	SciArena LeaderboardAider Polyglot Leaderboard
🥉	gemini-2.5-pro Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	21.8%	36%	$3.44	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#4	gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg TSQ and LanguageBench overall:mean	19.2%	26%	$0.17	Galileo Agent Leaderboard v2LanguageBench
#5	Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	18.9%	26%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#6	gpt-5-mini-2025-08-07 Strong on SciArena Leaderboard rating_elo and Vals MedQA overall_accuracy_pct	18.1%	29%	—	SciArena LeaderboardVals MedQA
#7	gemini-3.1-pro-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vals GPQA overall_accuracy_pct	17.3%	19%	$4.50	Vals Mortgage TaxVals GPQA
#8	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	17.3%	24%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#9	o3-20250416 Strong on SciArena Leaderboard rating_elo and Aider Polyglot Leaderboard percent_correct_pct	17.2%	22%	$3.50	SciArena LeaderboardAider Polyglot Leaderboard
#10	gemini-3-pro-preview Strong on SciArena Leaderboard rating_elo and Vals Mortgage Tax overall_accuracy_pct	15.7%	21%	$4.50	SciArena LeaderboardVals Mortgage Tax
#11	claude-sonnet-4.6 Strong on HalluHard Leaderboard overall_hallucination_error_pct and Vals Tax Eval v2 overall_accuracy_pct	14.1%	19%	$6.00	HalluHard LeaderboardVals Tax Eval v2
#12	gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and Vals MedQA overall_accuracy_pct	14.0%	18%	$1.13	Vals Legal BenchVals MedQA
#13	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Tax Eval v2 overall_accuracy_pct	13.8%	16%	—	FACTS Benchmark SuiteVals Tax Eval v2
#14	Claude-3.5-Sonnet Strong on LanguageBench overall:mean and LLM-AggreFact Leaderboard average_score_pct	13.2%	16%	$6.00	LanguageBenchLLM-AggreFact Leaderboard
#15	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals MedQA overall_accuracy_pct	13.2%	15%	—	Vectara HHEM LeaderboardVals MedQA
#16	gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	12.9%	16%	—	Vals Case Law v2Vals MedScribe
#17	qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	12.7%	21%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#18	o4-mini Strong on SciArena Leaderboard rating_elo and Aider Polyglot Leaderboard percent_correct_pct	12.3%	20%	$1.93	SciArena LeaderboardAider Polyglot Leaderboard
#19	deepseek-r1 Strong on ContractEval Leaderboard contract_adherence_csr_pct and SciArena Leaderboard rating_elo	12.2%	23%	$0.27	ContractEval LeaderboardSciArena Leaderboard
#20	claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	12.1%	13%	—	Vals SWE-benchVals Mortgage Tax
#21	gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC	11.9%	16%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#22	claude-opus-4-5-20251101-thinking Strong on Vals MedQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	11.7%	13%	—	Vals MedQAVals Mortgage Tax
#23	claude-opus-4-5-20251101 Strong on Vals Mortgage Tax overall_accuracy_pct and Vals MedQA overall_accuracy_pct	11.7%	15%	—	Vals Mortgage TaxVals MedQA
#24	gemini-3.1-flash-lite-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	11.7%	17%	$0.56	Vals Mortgage TaxVectara HHEM Leaderboard
#25	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals MedQA overall_accuracy_pct	11.6%	19%	—	Vals CorpFin v2Vals MedQA
#26	gpt-4o Strong on MEGA-Bench app:Planning and MEGA-Bench overall_score	11.1%	15%	$0.26	MEGA-BenchMEGA-Bench
#27	gpt-4.1 Strong on LanguageBench overall:mean and SciArena Leaderboard rating_elo	11.1%	15%	$3.50	LanguageBenchSciArena Leaderboard
#28	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Tax Eval v2 overall_accuracy_pct	10.7%	18%	$0.28	Vals CorpFin v2Vals Tax Eval v2
#29	claude-sonnet-4-5-20250929-thinking Strong on Vals MedQA overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	10.6%	13%	—	Vals MedQAVals Legal Bench
#32	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals MedQA overall_accuracy_pct	9.6%	15%	$0.28	Vals CorpFin v2Vals MedQA

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals Legal Bench

43 rows · 0.6% avg lift

Vals Tax Eval v2

43 rows · 0.6% avg lift

Vals GPQA

43 rows · 0.5% avg lift

Vals LiveCodeBench

42 rows · 0.6% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.planning_task_breakdowntask.tradeoff_analysis

Required modes

none

Domains

domain.supply_chain_logistics

Related in Supply Chain

Disruption monitoring brief

Summarize disruptions into risk, options, and recommendations.

Supplier risk monitoring

Track supplier risk signals from multi-source text and summarize actions.

Vendor contract summary (procurement)

Summarize vendor contracts into key terms, risks, and deviations.

Tail spend categorization

Categorize tail spend purchases into taxonomy buckets for sourcing.