Exercise generator
Generate practice problems with solutions and hints by difficulty level.
Provisional leader
gpt-4.1-20250414
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
19.5%
Best benchmark score
30.1%
Confidence
All ranked models โ top 3
Ranked Models
28
Evidence Quality
80%
Evidence Points
22
Top Signal
OpenVLM TextVQA Official: textvqa_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 19.5% |
| ๐ฅ | gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 17.3% |
| #6 | claude-sonnet-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 16.8% |
| #8 | gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct | 16.3% |
| #9 | Claude-3.5-Sonnet Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 16.2% |
| #12 | gemini-2.5-flash Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 15.1% |
| #27 | gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 12.9% |
| #28 | gemini-2.0-flash-001 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 12.7% |
| #30 | gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 12.6% |
| #32 | gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 12.2% |
| #33 | gpt-4.1 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 12.2% |
| #42 | gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 11.3% |
| #64 | deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and DuckDB NSQL Leaderboard all_execution_accuracy | 9.8% |
| #66 | Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct | 9.7% |
| #68 | Llama-3.1-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 9.6% |
| #77 | gemini-3-pro-preview Strong on Humanity's Last Exam Leaderboard hle_accuracy_pct and Vals GPQA overall_accuracy_pct | 9.0% |
| #78 | kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 8.9% |
| #82 | o3-20250416 Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 8.8% |
| #87 | Llama-3.3-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 8.6% |
| #90 | gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct | 8.6% |
| #92 | qwen-2.5-72b-instruct Strong on Multilingual MMLU Benchmark mmlu and DuckDB NSQL Leaderboard all_execution_accuracy | 8.4% |
| #98 | Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ | 8.0% |
| #103 | GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 7.6% |
| #117 | o4-mini Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 7.1% |
| #133 | phi-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy | 6.4% |
| #134 | gemini-3.1-flash-lite-preview Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 6.3% |
| #150 | Meta-Llama-3-8B-Instruct Strong on Multilingual MMLU Benchmark mmlu and LLM Trustworthy Leaderboard fairness | 4.0% |
| #155 | Qwen3-30B-A3B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and EQ-Bench Leaderboard eq_bench_score | 1.9% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
28
Sources
8
Quality
Low
Vals GPQA
Vals Mortgage Tax
Vals MedQA
EQ-Bench Leaderboard
Missing frontier models
claude-opus-4-5-20251101
Thin evidence after weightingRank #10
18.6%
claude-sonnet-4.6
Thin evidence after weightingRank #11
20.0%
grok-4-1-fast-reasoning
Thin evidence after weightingRank #12
19.5%
grok-4-1-fast-non-reasoning
Thin evidence after weightingRank #15
14.9%
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Education
Language conversation partner
Conversational practice with gentle corrections and explanations.
Grammar and writing coach
Correct grammar and explain fixes at the learner's level.
Grading and feedback assistant
Provide rubric-tagged feedback drafts for educator review.
Lesson plan generator
Generate lesson plans with objectives, activities, and assessments.