Lesson plan generator
Generate lesson plans with objectives, activities, and assessments.
Provisional leader
gpt-4.1-20250414
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
22.1%
Best benchmark score
35.3%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
80%
Evidence Points
23
Top Signal
OpenVLM TextVQA Official: textvqa_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 22.1% |
| #5 | claude-sonnet-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 18.8% |
| #7 | gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct | 18.3% |
| #9 | gemini-2.5-flash Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 17.3% |
| #10 | gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 17.3% |
| #12 | Claude-3.5-Sonnet Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and OpenVLM OCRBench Official ocrbench_score_pct | 17.0% |
| #27 | gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 14.4% |
| #29 | gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct | 14.1% |
| #34 | gemini-2.0-flash-001 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 13.6% |
| #40 | gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 12.7% |
| #45 | gpt-4.1 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 12.4% |
| #57 | gemini-3-pro-preview Strong on Humanity's Last Exam Leaderboard hle_accuracy_pct and Vals GPQA overall_accuracy_pct | 11.4% |
| #65 | Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct | 10.8% |
| #68 | gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct | 10.5% |
| #71 | Llama-3.1-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 10.2% |
| #72 | o3-20250416 Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 10.2% |
| #73 | gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 10.1% |
| #77 | kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 10.0% |
| #82 | deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LanguageBench Translation Official (Split) translation_to:bleu | 9.8% |
| #85 | Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ | 9.7% |
| #103 | Llama-3.3-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu | 8.5% |
| #104 | GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct | 8.5% |
| #110 | claude-opus-4-5-20251101-thinking Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct | 8.3% |
| #113 | o4-mini Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 8.3% |
| #118 | gemini-3-flash-preview Strong on Vals GPQA overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct | 8.0% |
| #135 | Kimi K2 Thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct | 7.4% |
| #138 | grok-4-1-fast-reasoning Strong on Vals GPQA overall_accuracy_pct and BFCL Multi-turn Official Multi Turn Acc | 7.1% |
| #140 | gemini-3.1-flash-lite-preview Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 7.1% |
| #142 | claude-opus-4-5-20251101 Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct | 7.1% |
| #144 | claude-sonnet-4-5-20250929-thinking Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct | 6.9% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
38
Sources
8
Quality
Low
Vals GPQA
Vals Mortgage Tax
Vals MedQA
Vals Legal Bench
Missing frontier models
claude-sonnet-4.6
Thin evidence after weightingRank #11
20.0%
grok-4-1-fast-non-reasoning
Thin evidence after weightingRank #15
14.9%
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Education
Language conversation partner
Conversational practice with gentle corrections and explanations.
Grammar and writing coach
Correct grammar and explain fixes at the learner's level.
Grading and feedback assistant
Provide rubric-tagged feedback drafts for educator review.
Exercise generator
Generate practice problems with solutions and hints by difficulty level.