BasedAGIBasedAGI

developer_tools

anthropic/claude-sonnet-4 vs gpt-5-2025-08-07

For Refactoring assistant

Benchmark coverage is still limited for this use case, so this comparison is directional rather than definitive.

Model A leads so farby +3.9%

Model A

Current leader

anthropic/claude-sonnet-4

external/anthropic/claude-sonnet-4

31.2%

Rank #1

Confidence

43.8%

Evidence

27 pts

Confidence 43.8%27 evidence pts

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

Value 81.7% · Conf 100.0% · Weight 3.0%

swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 84.8% · Conf 100.0% · Weight 1.9%

galileo_agent_v2.avg_ac (Apr 1, 2026)

Sonar Java Quality Leaderboard: functional_skill_pct

Value 79.5% · Conf 100.0% · Weight 1.8%

sonar_java_quality.functional_skill_pct (Apr 1, 2026)

Aider Polyglot Leaderboard: percent_correct_pct

Value 67.9% · Conf 100.0% · Weight 1.3%

aider_polyglot.percent_correct_pct (Apr 1, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 58.5% · Conf 100.0% · Weight 1.1%

sonar_java_quality.issue_density_error_per_kloc (Apr 1, 2026)

Model B

gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

27.3%

Rank #2

Confidence

32.9%

Evidence

27 pts

Confidence 32.9%27 evidence pts

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

Value 93.8% · Conf 100.0% · Weight 3.5%

swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)

Aider Polyglot Leaderboard: percent_correct_pct

Value 100.0% · Conf 100.0% · Weight 1.9%

aider_polyglot.percent_correct_pct (Apr 1, 2026)

Vals LiveCodeBench: overall_accuracy_pct

Value 96.5% · Conf 100.0% · Weight 1.3%

vals_lcb.overall_accuracy_pct (Mar 31, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 78.3% · Conf 100.0% · Weight 1.2%

vals_swebench.overall_accuracy_pct (Mar 31, 2026)

Vals Terminal-Bench 2: overall_accuracy_pct

Value 53.4% · Conf 100.0% · Weight 0.7%

vals_terminal_bench_2.overall_accuracy_pct (Mar 31, 2026)