Healthcare

Clinical note drafting

Summarize encounters into structured notes for clinician review.

task.summarize_meeting_transcripttask.json_schema_filling

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

claude-sonnet-4

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

20.9%

Best benchmark score

28.2%

Confidence

All ranked models — top 3

🥇

claude-sonnet-4

20.9%

🥈

gpt-5-2025-08-07

20.5%

🥉

gemini-2.5-flash

20.3%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

Galileo Agent Leaderboard v2: Healthcare AC

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Healthcare AC and Vals MedQA overall_accuracy_pct	20.9%	28%	$6.00	Galileo Agent Leaderboard v2Vals MedQA
🥈	gpt-5-2025-08-07 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	20.5%	37%	—	Vals MedQAVals MedScribe
🥉	gemini-2.5-flash Strong on BRIDGE Medical Leaderboard average_performance_pct and Vals MedScribe overall_accuracy_pct	20.3%	29%	$0.17	BRIDGE Medical LeaderboardVals MedScribe
#4	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Healthcare AC and MMLongBench-Doc Leaderboard acc_score_pct	20.3%	27%	—	Galileo Agent Leaderboard v2MMLongBench-Doc Leaderboard
#5	gemini-2.5-pro Strong on Vectara HHEM Leaderboard medicine_hallucination_error_pct and Galileo Agent Leaderboard v2 Healthcare TSQ	20.2%	37%	$3.44	Vectara HHEM LeaderboardGalileo Agent Leaderboard v2
#6	gpt-5-mini-2025-08-07 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	19.4%	34%	—	Vals MedQAVals MedScribe
#7	gemini-3.1-pro-preview Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	18.8%	22%	$4.50	Vals MedQAVectara HHEM Leaderboard
#8	Grok-4-0709 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	17.5%	27%	—	Vals MedQAVals MedScribe
#9	qwen-2.5-72b-instruct Strong on BRIDGE Medical Leaderboard average_performance_pct and Galileo Agent Leaderboard v2 Healthcare AC	16.6%	22%	—	BRIDGE Medical LeaderboardGalileo Agent Leaderboard v2
#10	gpt-4o Strong on MedHELM average_score_pct and MedHELM clinical_note_generation_win_rate_pct	16.5%	20%	$0.26	MedHELMMedHELM
#11	gemini-3-pro-preview Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	16.4%	21%	$4.50	Vals MedQAVectara HHEM Leaderboard
#12	claude-opus-4-5-20251101 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	15.2%	21%	—	Vals MedQAVals MedScribe
#13	gemini-3-flash-preview Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	14.8%	19%	$1.13	Vals MedQAVectara HHEM Leaderboard
#14	gpt-5.4-2026-03-05 Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	14.2%	19%	—	Vals MedQAVectara HHEM Leaderboard
#15	gpt-5.2-2025-12-11 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	14.1%	17%	—	Vals MedQAVals MedScribe
#16	gpt-5.1-2025-11-13 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	13.3%	17%	—	Vals MedQAVals MedScribe
#17	o3-20250416 Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	13.3%	20%	$3.50	Vals MedQAVals MedScribe
#18	grok-4-fast-reasoning Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	12.2%	21%	$0.28	Vals MedQAVals MedScribe
#19	gemini-2.5-flash-lite Strong on Galileo Agent Leaderboard v2 Healthcare AC and Vectara HHEM Leaderboard medicine_hallucination_error_pct	12.1%	19%	$0.17	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#20	Claude-3.5-Sonnet Strong on MedHELM average_score_pct and DuckDB NSQL Leaderboard all_execution_accuracy	11.8%	18%	$6.00	MedHELMDuckDB NSQL Leaderboard
#21	claude-opus-4-1-20250805 Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	11.7%	19%	—	Vals MedQAVectara HHEM Leaderboard
#22	grok-4-1-fast-reasoning Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	11.6%	19%	$0.28	Vals MedQAVals MedScribe
#23	claude-opus-4-6-thinking Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	11.4%	13%	—	Vals MedQAVals MedScribe
#24	gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Healthcare AC and Vals MedQA overall_accuracy_pct	11.3%	14%	—	Galileo Agent Leaderboard v2Vals MedQA
#25	claude-opus-4-5-20251101-thinking Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	11.2%	13%	—	Vals MedQAVals MedScribe
#26	claude-sonnet-4.6 Strong on Vals MedQA overall_accuracy_pct and Vectara HHEM Leaderboard medicine_hallucination_error_pct	11.1%	15%	$6.00	Vals MedQAVectara HHEM Leaderboard
#27	gemini-3.1-flash-lite-preview Strong on Vectara HHEM Leaderboard medicine_hallucination_error_pct and FACTS Benchmark Suite facts_grounding_score_pct	10.6%	16%	$0.56	Vectara HHEM LeaderboardFACTS Benchmark Suite
#28	claude-sonnet-4-5-20250929-thinking Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	10.3%	13%	—	Vals MedQAVals MedScribe
#30	deepseek-r1 Strong on BRIDGE Medical Leaderboard average_performance_pct and DuckDB NSQL Leaderboard all_execution_accuracy	10.1%	23%	$0.27	BRIDGE Medical LeaderboardDuckDB NSQL Leaderboard
#31	kimi-k2.5-thinking Strong on Vals MedQA overall_accuracy_pct and Vals MedScribe overall_accuracy_pct	9.8%	15%	—	Vals MedQAVals MedScribe

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals MedQA

39 rows · 2.3% avg lift

Vals LiveCodeBench

36 rows · 0.3% avg lift

Vals Legal Bench

35 rows · 0.3% avg lift

Vals Tax Eval v2

35 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.summarize_meeting_transcripttask.json_schema_filling

Required modes

mode.long_contextmode.json_schema

Domains

domain.healthcare_clinical

Related in Healthcare

Patient education bot (RAG grounded)

Answer patient FAQ using trusted sources with cautious wording.

Medical coding support (suggestions)

Extract coding-relevant facts and suggest codes for human review.

Patient-friendly explanations

Rewrite technical notes into clear, accessible patient language.

Medical chart summary

Summarize a patient's chart into timeline, problems, and meds for review.