Campaign brief
Draft a campaign brief with positioning, audience, and channels.
Provisional leader
claude-sonnet-4
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
34.3%
Best benchmark score
44.7%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
82%
Evidence Points
23
Top Signal
Galileo Agent Leaderboard v2: Avg TSQ
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg TSQ and EQ-Bench Leaderboard eq_bench_score | 34.3% |
| ๐ฅ | gemini-2.5-pro Strong on EQ-Bench Leaderboard eq_bench_score and Galileo Agent Leaderboard v2 Avg TSQ | 33.2% |
| ๐ฅ | Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg TSQ and EQ-Bench Leaderboard eq_bench_score | 31.4% |
| #4 | gpt-5-2025-08-07 Strong on EQ-Bench Leaderboard eq_bench_score and UGI Leaderboard Writing โ๏ธ | 31.0% |
| #5 | o3-20250416 Strong on EQ-Bench Leaderboard eq_bench_score and UGI Leaderboard Writing โ๏ธ | 26.1% |
| #6 | gemini-3.1-pro-preview Strong on UGI Leaderboard Writing โ๏ธ and Vals Mortgage Tax overall_accuracy_pct | 24.7% |
| #7 | gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC | 24.7% |
| #8 | gpt-4o Strong on CRMArena Function Calling overall_score_pct and EQ-Bench Leaderboard eq_bench_score | 20.8% |
| #9 | gemini-3-flash-preview Strong on UGI Leaderboard Writing โ๏ธ and Vals Legal Bench overall_accuracy_pct | 20.6% |
| #10 | gemini-3-pro-preview Strong on UGI Leaderboard Writing โ๏ธ and Vals Mortgage Tax overall_accuracy_pct | 19.9% |
| #11 | o4-mini Strong on EQ-Bench Leaderboard eq_bench_score and UGI Leaderboard Writing โ๏ธ | 19.7% |
| #12 | gpt-5.4-2026-03-05 Strong on UGI Leaderboard Writing โ๏ธ and Vectara HHEM Leaderboard overall_hallucination_error_pct | 19.7% |
| #13 | gpt-5.2-2025-12-11 Strong on UGI Leaderboard Writing โ๏ธ and FACTS Benchmark Suite facts_grounding_score_pct | 19.4% |
| #14 | gpt-5-mini-2025-08-07 Strong on Vals MedQA overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct | 19.2% |
| #15 | gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC | 19.1% |
| #16 | claude-sonnet-4.6 Strong on UGI Leaderboard Writing โ๏ธ and Vals Tax Eval v2 overall_accuracy_pct | 18.8% |
| #17 | Kimi-K2-Instruct Strong on Galileo Agent Leaderboard v2 Avg TSQ and EQ-Bench Leaderboard eq_bench_score | 18.6% |
| #18 | gpt-5.1-2025-11-13 Strong on UGI Leaderboard Writing โ๏ธ and Vals Case Law v2 overall_accuracy_pct | 18.6% |
| #19 | claude-opus-4 Strong on EQ-Bench Leaderboard eq_bench_score and UGI Leaderboard Writing โ๏ธ | 18.0% |
| #20 | claude-opus-4-5-20251101 Strong on UGI Leaderboard Writing โ๏ธ and Vals Mortgage Tax overall_accuracy_pct | 17.9% |
| #21 | qwen-2.5-72b-instruct Strong on EQ-Bench Leaderboard eq_bench_score and Galileo Agent Leaderboard v2 Avg TSQ | 17.8% |
| #22 | grok-3 Strong on EQ-Bench Leaderboard eq_bench_score and UGI Leaderboard Writing โ๏ธ | 17.0% |
| #23 | kimi-k2.5-thinking Strong on UGI Leaderboard Writing โ๏ธ and Vals CorpFin v2 overall_accuracy_pct | 16.7% |
| #24 | Claude-3.5-Sonnet Strong on EQ-Bench Leaderboard eq_bench_score and CRMArena Function Calling overall_score_pct | 16.4% |
| #25 | grok-4-fast-reasoning Strong on UGI Leaderboard Writing โ๏ธ and Vals CorpFin v2 overall_accuracy_pct | 15.8% |
| #26 | deepseek-r1 Strong on EQ-Bench Leaderboard eq_bench_score and DuckDB NSQL Leaderboard all_execution_accuracy | 15.0% |
| #27 | claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 14.9% |
| #29 | gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Galileo Agent Leaderboard v2 Avg AC | 14.8% |
| #30 | claude-opus-4-5-20251101-thinking Strong on Vals MedQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 14.5% |
| #31 | gemini-3.1-flash-lite-preview Strong on Vals Mortgage Tax overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 14.5% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
71
Sources
8
Quality
Low
Vals Legal Bench
Vals Tax Eval v2
Vals LiveCodeBench
Vals GPQA
Missing frontier models
No obvious gaps right now.
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Marketing
Social listening brief
Summarize social chatter into themes, risks, and recommendations.
Product positioning and messaging
Develop positioning, value props, and message pillars with tradeoffs.
Social post generation
Generate short channel-specific social posts and variations.
Landing page copy
Draft landing pages with clear positioning and structure.