developer_tools

Best LLM for Function Calling

Compare models for reliable tool use, function selection, and multi-step API orchestration.

This page is high-intent, but the current benchmark evidence for this use case is still limited. Treat the leader below as provisional.

Provisional leader

claude-opus-4-6

Best current option from the available benchmark evidence, but not yet a strong winner claim.

external/anthropic/claude-opus-4-6

21.8%

Score

24.0%

Confidence

Evidence

$10.00

per 1M tokens

Runners-up:#2 gpt-5-2025-08-07 (19.7%)#3 anthropic/claude-sonnet-4 (16.8%)#4 anthropic/claude-sonnet-4.6 (15.9%)

Ranked Models

Evidence Quality

80%

Evidence Points

Top Signal

OpenHands Index: average_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-opus-4-6 Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct	21.8%	24%	$10.00	OpenHands IndexOpenHands Issue Resolution
🥈	gpt-5-2025-08-07 Strong on GAIA Results Public score and Aider Polyglot Leaderboard percent_correct_pct	19.7%	23%	—	GAIA Results PublicAider Polyglot Leaderboard
#8	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	16.8%	23%	$6.00	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#9	claude-sonnet-4.6 Strong on OpenHands Issue Resolution issue_resolution_score_pct and Vals SWE-bench overall_accuracy_pct	15.9%	29%	$6.00	OpenHands Issue ResolutionVals SWE-bench
#10	gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and BFCL Multi-turn Official Multi Turn Acc	15.0%	19%	$4.50	SWE-bench Verified LeaderboardBFCL Multi-turn Official
#11	kimi-k2.5-thinking Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	14.3%	29%	—	Vals LiveCodeBenchVals SWE-bench
#14	GLM-5 Strong on OpenHands Issue Resolution issue_resolution_score_pct and Sonar Java Quality Leaderboard functional_skill_pct	13.4%	24%	—	OpenHands Issue ResolutionSonar Java Quality Leaderboard
#15	qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg AC and Aider Code Editing Leaderboard percent_correct_pct	13.0%	17%	—	Galileo Agent Leaderboard v2Aider Code Editing Leaderboard
#16	gpt-4o Strong on τ-bench Airline (Official README) tau_airline_pass1_pct and JSONSchemaBench Leaderboard medium_schema_compliance_pct	13.0%	20%	$0.26	τ-bench Airline (Official README)JSONSchemaBench Leaderboard
#17	Grok-4-0709 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	12.9%	20%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#19	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	12.3%	20%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#20	Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct	12.2%	33%	$1.07	SWE-bench Verified LeaderboardSonar Java Quality Leaderboard
#21	o3-20250416 Strong on Aider Polyglot Leaderboard percent_correct_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	11.9%	17%	$3.50	Aider Polyglot LeaderboardSWE-bench Verified Leaderboard
#22	gemini-2.5-pro Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	11.8%	18%	$3.44	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#23	gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct	11.5%	15%	—	SWE-bench Verified LeaderboardVals SWE-bench
#24	minimax-m2.1 Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	10.8%	31%	$0.53	Vals SWE-benchVals LiveCodeBench
#28	gpt-4o-20241120 Strong on Aider Code Editing Leaderboard percent_correct_pct and DuckDB NSQL Leaderboard all_execution_accuracy	10.4%	17%	—	Aider Code Editing LeaderboardDuckDB NSQL Leaderboard
#29	gpt-4o-2024-05-13 Strong on Aider Code Editing Leaderboard percent_correct_pct and RepoQA Official Results overall_average_pass_at_1_pct	10.2%	14%	—	Aider Code Editing LeaderboardRepoQA Official Results
#30	gemini-3.1-pro-preview Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	9.8%	10%	$4.50	Vals SWE-benchVals LiveCodeBench
#31	o4-mini Strong on Aider Polyglot Leaderboard percent_correct_pct and Vals LiveCodeBench overall_accuracy_pct	9.4%	15%	$1.93	Aider Polyglot LeaderboardVals LiveCodeBench
#32	gpt-5-mini-2025-08-07 Strong on Vals LiveCodeBench overall_accuracy_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	9.3%	14%	—	Vals LiveCodeBenchSWE-bench Verified Leaderboard
#34	gpt-4o-2024-08-06 Strong on Aider Code Editing Leaderboard percent_correct_pct and GAIA Results Public score	8.9%	16%	—	Aider Code Editing LeaderboardGAIA Results Public
#35	Kimi-K2-Instruct Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	8.9%	12%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#36	claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Terminal-Bench 2 overall_accuracy_pct	8.7%	13%	—	SWE-bench Verified LeaderboardVals Terminal-Bench 2
#37	deepseek-r1 Strong on Aider Polyglot Leaderboard percent_correct_pct and Sonar Java Quality Leaderboard functional_skill_pct	8.5%	12%	$0.27	Aider Polyglot LeaderboardSonar Java Quality Leaderboard
#40	grok-4-1-fast-reasoning Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc	8.2%	13%	$0.28	BFCL Multi-turn OfficialBFCL Memory Official
#42	gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	8.0%	14%	$0.17	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#43	gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	8.0%	12%	—	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#44	glm-4.7 Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	7.8%	11%	—	Vals LiveCodeBenchVals SWE-bench
#46	deepseek-v3 Strong on Galileo Agent Leaderboard v2 Avg AC and BigCodeBench Official bigcodebench_complete_pct	7.5%	11%	—	Galileo Agent Leaderboard v2BigCodeBench Official

Head-to-Head: #1 vs #2

Top Pick

claude-opus-4-6

Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct

21.8%

Conf 24.0%

gpt-5-2025-08-07

Strong on GAIA Results Public score and Aider Polyglot Leaderboard percent_correct_pct

19.7%

Conf 23.3%

Full Comparison with Benchmark Evidence →

Full Use-Case Page Browse All Use Cases How We Score

Related Lookups

Best LLM for Code Generation

Benchmark-backed ranking of models for generating correct, secure code from requirements.

Best LLM for Debugging

Find the top-ranked models for localizing bugs and proposing fixes with explanations.

Best LLM for Unit Test Generation

Ranked models for generating meaningful unit tests and edge cases from code.

Best LLM for Code Review

Compare models for automated PR review covering correctness, security, and maintainability.

Best LLM for Autonomous Coding

Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.

Best LLM for Refactoring

Ranked models for safely refactoring code while preserving behavior and improving clarity.