Education

Lesson plan generator

Generate lesson plans with objectives, activities, and assessments.

task.outline_generationtask.tutoring_socratic

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gpt-4.1-20250414

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

22.1%

Best benchmark score

35.3%

Confidence

All ranked models — top 3

🥇

gpt-4.1-20250414

22.1%

🥈

claude-sonnet-4

18.8%

🥉

gpt-4.1-mini-20250414

18.3%

Ranked Models

Evidence Quality

80%

Evidence Points

Top Signal

OpenVLM TextVQA Official: textvqa_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	22.1%	35%	—	OpenVLM TextVQA OfficialOpenVLM OCRBench Official
#5	claude-sonnet-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	18.8%	27%	$6.00	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#7	gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct	18.3%	28%	—	OpenVLM OCRBench OfficialOpenVLM TextVQA Official
#9	gemini-2.5-flash Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	17.3%	24%	$0.17	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#10	gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	17.3%	23%	—	OpenVLM OCRBench OfficialMathArena Models
#12	Claude-3.5-Sonnet Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	17.0%	28%	$6.00	LanguageBench Grammar/Clarity Official (Split)OpenVLM OCRBench Official
#27	gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	14.4%	30%	$3.44	OpenVLM OCRBench OfficialMathArena Models
#29	gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	14.1%	20%	—	OpenVLM OCRBench OfficialMathArena Models
#34	gemini-2.0-flash-001 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	13.6%	17%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#40	gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	12.7%	14%	$4.50	MathArena ModelsVals GPQA
#45	gpt-4.1 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	12.4%	18%	$3.50	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#57	gemini-3-pro-preview Strong on Humanity's Last Exam Leaderboard hle_accuracy_pct and Vals GPQA overall_accuracy_pct	11.4%	14%	$4.50	Humanity's Last Exam LeaderboardVals GPQA
#65	Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct	10.8%	20%	—	OpenVLM TextVQA OfficialOpenVLM OCRVQA Education & Teaching Official
#68	gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct	10.5%	13%	—	Vals GPQAHumanity's Last Exam Leaderboard
#71	Llama-3.1-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	10.2%	18%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#72	o3-20250416 Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	10.2%	15%	$3.50	MathArena ModelsVals GPQA
#73	gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	10.1%	18%	$0.26	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#77	kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	10.0%	15%	—	MathArena ModelsVals GPQA
#82	deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LanguageBench Translation Official (Split) translation_to:bleu	9.8%	22%	$0.27	SYCON Bench (Table 2)LanguageBench Translation Official (Split)
#85	Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ	9.7%	14%	—	Vals GPQAGalileo Agent Leaderboard v2
#103	Llama-3.3-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	8.5%	12%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#104	GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	8.5%	15%	—	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#110	claude-opus-4-5-20251101-thinking Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct	8.3%	11%	—	Vals GPQAHumanity's Last Exam Leaderboard
#113	o4-mini Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	8.3%	14%	$1.93	MathArena ModelsVals GPQA
#118	gemini-3-flash-preview Strong on Vals GPQA overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	8.0%	10%	$1.13	Vals GPQAVals Legal Bench
#135	Kimi K2 Thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	7.4%	11%	$1.07	MathArena ModelsVals GPQA
#138	grok-4-1-fast-reasoning Strong on Vals GPQA overall_accuracy_pct and BFCL Multi-turn Official Multi Turn Acc	7.1%	11%	$0.28	Vals GPQABFCL Multi-turn Official
#140	gemini-3.1-flash-lite-preview Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	7.1%	12%	$0.56	Vals GPQAVals Mortgage Tax
#142	claude-opus-4-5-20251101 Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	7.1%	10%	—	Vals GPQAVals Mortgage Tax
#144	claude-sonnet-4-5-20250929-thinking Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct	6.9%	11%	—	Vals GPQAHumanity's Last Exam Leaderboard

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals GPQA

23 rows · 1.3% avg lift

Vals Mortgage Tax

21 rows · 0.3% avg lift

Vals MedQA

21 rows · 0.4% avg lift

Vals Legal Bench

21 rows · 0.3% avg lift

Missing frontier models

claude-sonnet-4.6

Thin evidence after weighting

Rank #11

20.0%

grok-4-1-fast-non-reasoning

Thin evidence after weighting

Rank #15

14.9%

▶Taxonomy & task details

Core tasks

task.outline_generationtask.tutoring_socratic

Required modes

none

Domains

domain.education_tutoring

Related in Education

Language conversation partner

Conversational practice with gentle corrections and explanations.

Grammar and writing coach

Correct grammar and explain fixes at the learner's level.

Grading and feedback assistant

Provide rubric-tagged feedback drafts for educator review.

Exercise generator

Generate practice problems with solutions and hints by difficulty level.