Education

Exercise generator

Generate practice problems with solutions and hints by difficulty level.

task.tutoring_socratictask.json_schema_filling

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gpt-4.1-20250414

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

19.5%

Best benchmark score

30.1%

Confidence

All ranked models — top 3

🥇

gpt-4.1-20250414

19.5%

🥈

gpt-5-2025-08-07

17.3%

🥉

claude-sonnet-4

16.8%

Ranked Models

Evidence Quality

80%

Evidence Points

Top Signal

OpenVLM TextVQA Official: textvqa_score_pct

All Ranked Models

28 of 28 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	19.5%	30%	—	OpenVLM TextVQA OfficialOpenVLM OCRBench Official
🥉	gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	17.3%	25%	—	OpenVLM OCRBench OfficialMathArena Models
#6	claude-sonnet-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	16.8%	24%	$6.00	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#8	gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct	16.3%	25%	—	OpenVLM OCRBench OfficialOpenVLM TextVQA Official
#9	Claude-3.5-Sonnet Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	16.2%	26%	$6.00	LanguageBench Grammar/Clarity Official (Split)OpenVLM OCRBench Official
#12	gemini-2.5-flash Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	15.1%	20%	$0.17	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#27	gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	12.9%	27%	$3.44	OpenVLM OCRBench OfficialMathArena Models
#28	gemini-2.0-flash-001 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	12.7%	16%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#30	gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	12.6%	18%	—	OpenVLM OCRBench OfficialMathArena Models
#32	gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	12.2%	19%	$0.26	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#33	gpt-4.1 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	12.2%	17%	$3.50	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#42	gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	11.3%	12%	$4.50	MathArena ModelsVals GPQA
#64	deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and DuckDB NSQL Leaderboard all_execution_accuracy	9.8%	21%	$0.27	SYCON Bench (Table 2)DuckDB NSQL Leaderboard
#66	Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct	9.7%	18%	—	OpenVLM TextVQA OfficialOpenVLM OCRVQA Education & Teaching Official
#68	Llama-3.1-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	9.6%	17%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#77	gemini-3-pro-preview Strong on Humanity's Last Exam Leaderboard hle_accuracy_pct and Vals GPQA overall_accuracy_pct	9.0%	11%	$4.50	Humanity's Last Exam LeaderboardVals GPQA
#78	kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	8.9%	13%	—	MathArena ModelsVals GPQA
#82	o3-20250416 Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	8.8%	12%	$3.50	MathArena ModelsVals GPQA
#87	Llama-3.3-70B-Instruct Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	8.6%	11%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#90	gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and Humanity's Last Exam Leaderboard hle_accuracy_pct	8.6%	11%	—	Vals GPQAHumanity's Last Exam Leaderboard
#92	qwen-2.5-72b-instruct Strong on Multilingual MMLU Benchmark mmlu and DuckDB NSQL Leaderboard all_execution_accuracy	8.4%	16%	—	Multilingual MMLU BenchmarkDuckDB NSQL Leaderboard
#98	Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ	8.0%	11%	—	Vals GPQAGalileo Agent Leaderboard v2
#103	GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	7.6%	14%	—	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#117	o4-mini Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	7.1%	11%	$1.93	MathArena ModelsVals GPQA
#133	phi-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy	6.4%	12%	—	LanguageBench Grammar/Clarity Official (Split)DuckDB NSQL Leaderboard
#134	gemini-3.1-flash-lite-preview Strong on Vals GPQA overall_accuracy_pct and Vals Mortgage Tax overall_accuracy_pct	6.3%	11%	$0.56	Vals GPQAVals Mortgage Tax
#150	Meta-Llama-3-8B-Instruct Strong on Multilingual MMLU Benchmark mmlu and LLM Trustworthy Leaderboard fairness	4.0%	11%	—	Multilingual MMLU BenchmarkLLM Trustworthy Leaderboard
#155	Qwen3-30B-A3B Strong on DuckDB NSQL Leaderboard all_execution_accuracy and EQ-Bench Leaderboard eq_bench_score	1.9%	12%	—	DuckDB NSQL LeaderboardEQ-Bench Leaderboard

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals GPQA

15 rows · 1.2% avg lift

Vals Mortgage Tax

15 rows · 0.3% avg lift

Vals MedQA

13 rows · 0.3% avg lift

EQ-Bench Leaderboard

13 rows · 0.3% avg lift

Missing frontier models

claude-opus-4-5-20251101

Thin evidence after weighting

Rank #10

18.6%

claude-sonnet-4.6

Thin evidence after weighting

Rank #11

20.0%

grok-4-1-fast-reasoning

Thin evidence after weighting

Rank #12

19.5%

grok-4-1-fast-non-reasoning

Thin evidence after weighting

Rank #15

14.9%

▶Taxonomy & task details

Core tasks

task.tutoring_socratictask.json_schema_filling

Required modes

mode.json_schema

Domains

domain.education_tutoring

Related in Education

Language conversation partner

Conversational practice with gentle corrections and explanations.

Grammar and writing coach

Correct grammar and explain fixes at the learner's level.

Grading and feedback assistant

Provide rubric-tagged feedback drafts for educator review.

Lesson plan generator

Generate lesson plans with objectives, activities, and assessments.