Education

Grammar and writing coach

Correct grammar and explain fixes at the learner's level.

task.rewrite_claritytask.translate_general

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-2.5-flash

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

26.3%

Best benchmark score

35.0%

Confidence

All ranked models — top 3

🥇

gemini-2.5-flash

26.3%

🥈

gpt-4.1-20250414

23.2%

🥉

claude-sonnet-4

22.1%

Ranked Models

Evidence Quality

82%

Evidence Points

Top Signal

LanguageBench Translation Official (Split): translation_to:bleu

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-2.5-flash Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	26.3%	35%	$0.17	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
🥈	gpt-4.1-20250414 Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	23.2%	38%	—	OpenVLM TextVQA OfficialOpenVLM OCRBench Official
#4	claude-sonnet-4 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	22.1%	30%	$6.00	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#6	Claude-3.5-Sonnet Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	20.6%	33%	$6.00	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#7	gemini-2.0-flash-001 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	17.7%	21%	—	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#8	gemini-3-pro-preview Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	17.1%	21%	$4.50	BFCL Memory OfficialBFCL Multi-turn Official
#12	gpt-4.1 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	16.6%	22%	$3.50	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#13	gpt-4.1-mini-20250414 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM TextVQA Official textvqa_score_pct	16.5%	25%	—	OpenVLM OCRBench OfficialOpenVLM TextVQA Official
#15	gpt-5-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	15.6%	21%	—	OpenVLM OCRBench OfficialMathArena Models
#16	Grok-4-0709 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	15.4%	21%	—	BFCL Memory OfficialBFCL Relevance Detection Official
#17	o3-20250416 Strong on BFCL Memory Official Memory Acc and MathArena Models average_score_pct	15.1%	22%	$3.50	BFCL Memory OfficialMathArena Models
#32	gemini-2.5-pro Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	13.4%	33%	$3.44	OpenVLM OCRBench OfficialMathArena Models
#35	grok-4-1-fast-reasoning Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	13.3%	17%	$0.28	BFCL Memory OfficialBFCL Multi-turn Official
#37	Llama-3.1-70B-Instruct Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	13.1%	24%	—	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#39	gpt-5.2-2025-12-11 Strong on Vals GPQA overall_accuracy_pct and BFCL Relevance Detection Official Relevance Detection	13.0%	20%	—	Vals GPQABFCL Relevance Detection Official
#41	gpt-5-mini-2025-08-07 Strong on OpenVLM OCRBench Official ocrbench_score_pct and MathArena Models average_score_pct	12.7%	18%	—	OpenVLM OCRBench OfficialMathArena Models
#42	o4-mini Strong on BFCL Memory Official Memory Acc and MathArena Models average_score_pct	12.6%	21%	$1.93	BFCL Memory OfficialMathArena Models
#55	Llama-3.3-70B-Instruct Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct	11.6%	16%	—	LanguageBench Translation Official (Split)LanguageBench Grammar/Clarity Official (Split)
#57	gemini-3.1-pro-preview Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	11.4%	13%	$4.50	MathArena ModelsVals GPQA
#60	deepseek-r1 Strong on LanguageBench Translation Official (Split) translation_to:bleu and SYCON Bench (Table 2) sycon_unethical_tof_pct	11.3%	26%	$0.27	LanguageBench Translation Official (Split)SYCON Bench (Table 2)
#65	GLM-4.6 Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	11.0%	14%	—	BFCL Memory OfficialBFCL Multi-turn Official
#85	Qwen-VL-Chat Strong on OpenVLM TextVQA Official textvqa_score_pct and OpenVLM OCRVQA Education & Teaching Official ocrvqa_education_teaching_score_pct	9.8%	18%	—	OpenVLM TextVQA OfficialOpenVLM OCRVQA Education & Teaching Official
#91	grok-4-1-fast-non-reasoning Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	9.3%	17%	$0.28	BFCL Relevance Detection OfficialBFCL Memory Official
#92	claude-opus-4-5-20251101 Strong on Vals GPQA overall_accuracy_pct and BFCL Relevance Detection Official Relevance Detection	9.2%	17%	—	Vals GPQABFCL Relevance Detection Official
#93	gpt-4o Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	9.1%	16%	$0.26	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#98	kimi-k2.5-thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	9.0%	13%	—	MathArena ModelsVals GPQA
#108	Kimi-K2-Instruct Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	8.6%	12%	—	BFCL Memory OfficialBFCL Multi-turn Official
#118	phi-4 Strong on LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct and LanguageBench Translation Official (Split) translation_to:bleu	8.0%	17%	—	LanguageBench Grammar/Clarity Official (Split)LanguageBench Translation Official (Split)
#121	GPT-4.1-nano-2025-04-14 Strong on OpenVLM OCRBench Official ocrbench_score_pct and OpenVLM MTVQA Official mtvqa_score_pct	7.7%	14%	—	OpenVLM OCRBench OfficialOpenVLM MTVQA Official
#145	Kimi K2 Thinking Strong on MathArena Models average_score_pct and Vals GPQA overall_accuracy_pct	6.7%	10%	$1.07	MathArena ModelsVals GPQA

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals GPQA

20 rows · 1.2% avg lift

Vals MedQA

17 rows · 0.3% avg lift

Vals Legal Bench

16 rows · 0.3% avg lift

Vals Tax Eval v2

16 rows · 0.3% avg lift

Missing frontier models

claude-sonnet-4.6

Thin evidence after weighting

Rank #11

20.0%

▶Taxonomy & task details

Core tasks

task.rewrite_claritytask.translate_general

Required modes

mode.multilingual

Domains

domain.language_learning

Related in Education

Language conversation partner

Conversational practice with gentle corrections and explanations.

Grading and feedback assistant

Provide rubric-tagged feedback drafts for educator review.

Exercise generator

Generate practice problems with solutions and hints by difficulty level.

Lesson plan generator

Generate lesson plans with objectives, activities, and assessments.