Grading and feedback assistant
Provide rubric-tagged feedback drafts for educator review.
Provisional leader
gpt-4.1-20250414
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
20.8%
Best benchmark score
33.3%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
80%
Evidence Points
23
Top Signal
OpenVLM TextVQA Official: textvqa_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 20.8% |
| #6 | claude-sonnet-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 17.7% |
| #8 | gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct | 17.2% |
| #9 | Claude-3.5-Sonnet Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 17.1% |
| #11 | gemini-2.5-flash Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 16.3% |
| #12 | gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 16.3% |
| #18 | gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 14.5% |
| #28 | gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 13.6% |
| #29 | gemini-2.0-flash-001 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 13.4% |
| #31 | gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 13.3% |
| #33 | gpt-4.1 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 12.9% |
| #42 | gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 11.9% |
| #58 | gemini-3-pro-preview Strong on Humanity's Last Exam Leaderboard hle_accuracy_pct and Vals GPQA overall_accuracy_pct | 10.7% |
| #65 | deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and DuckDB NSQL Leaderboard all_execution_accuracy | 10.4% |
| #67 | Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct | 10.2% |
| #69 | Llama-3.1-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 10.1% |
| #70 | qwen-2.5-72b-instruct Strong on Multilingual MMLU Benchmark mmlu and DuckDB NSQL Leaderboard all_execution_accuracy | 10.1% |
| #72 | gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct | 9.9% |
| #76 | o3-20250416 Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 9.6% |
| #81 | kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 9.4% |
| #89 | Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ | 9.1% |
| #90 | Llama-3.3-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 9.1% |
| #106 | GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 8.0% |
| #114 | o4-mini Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 7.8% |
| #135 | Kimi K2 Thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 7.0% |
| #139 | phi-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy | 6.8% |
| #141 | gemini-3.1-flash-lite-preview Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 6.7% |
| #157 | gpt-4o-mini-2024-07-18 Strong on DuckDB NSQL Leaderboard all_execution_accuracy and LLM Trustworthy Leaderboard privacy | 4.9% |
| #160 | Meta-Llama-3-8B-Instruct Strong on Multilingual MMLU Benchmark mmlu and LLM Trustworthy Leaderboard fairness | 4.3% |
| #163 | Phi-4-multimodal-instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench mmlu:accuracy | 3.0% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
31
Sources
8
Quality
Low
Vals GPQA
Vals Mortgage Tax
Vals MedQA
Vals Legal Bench
Missing frontier models
claude-opus-4-5-20251101
Thin evidence after weightingRank #10
18.6%
claude-sonnet-4.6
Thin evidence after weightingRank #11
20.0%
grok-4-1-fast-reasoning
Thin evidence after weightingRank #12
19.5%
grok-4-1-fast-non-reasoning
Thin evidence after weightingRank #15
14.9%
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Education
Language conversation partner
Conversational practice with gentle corrections and explanations.
Grammar and writing coach
Correct grammar and explain fixes at the learner's level.
Exercise generator
Generate practice problems with solutions and hints by difficulty level.
Lesson plan generator
Generate lesson plans with objectives, activities, and assessments.