CAD scripting helper
Generate and debug CAD automation scripts and parametric geometry code.
Provisional leader
gpt-5-2025-08-07
Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.
28.0%
Best benchmark score
32.5%
Confidence
All ranked models โ top 3
Ranked Models
30
Evidence Quality
82%
Evidence Points
31
Top Signal
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| ๐ฅ | gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 28.0% |
| ๐ฅ | claude-opus-4-6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct | 27.5% |
| #4 | claude-sonnet-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 24.1% |
| #7 | Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct | 19.9% |
| #8 | claude-sonnet-4.6 Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index issue_resolution_score_pct | 19.0% |
| #11 | GLM-5 Strong on Sonar Java Quality Leaderboard functional_skill_pct and OpenHands Issue Resolution issue_resolution_score_pct | 18.2% |
| #12 | gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct | 17.4% |
| #13 | o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 16.7% |
| #15 | gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct | 16.3% |
| #16 | gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 15.4% |
| #18 | gpt-4o-20241120 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 15.2% |
| #19 | gpt-4o-2024-05-13 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and BigCodeBench Official bigcodebench_complete_pct | 14.6% |
| #20 | deepseek-r1 Strong on Aider Polyglot Leaderboard percent_correct_pct and Sonar Java Quality Leaderboard functional_skill_pct | 14.5% |
| #21 | claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals Terminal-Bench 2 overall_accuracy_pct | 14.4% |
| #22 | claude-opus-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 14.4% |
| #24 | gpt-4.1-20250414 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 14.1% |
| #25 | minimax-m2.1 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals SWE-bench overall_accuracy_pct | 13.9% |
| #26 | gpt-4.1 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 13.9% |
| #27 | gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct | 13.9% |
| #28 | o4-mini Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 13.5% |
| #30 | kimi-k2.5-thinking Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct | 12.9% |
| #32 | glm-4.7 Strong on Sonar Java Quality Leaderboard functional_skill_pct and Vals LiveCodeBench overall_accuracy_pct | 12.1% |
| #36 | gemini-3.1-pro-preview Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct | 10.9% |
| #37 | qwen-2.5-72b-instruct Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 10.7% |
| #42 | gpt-4o Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Sonar Java Quality Leaderboard functional_skill_pct | 10.2% |
| #44 | Grok-4-0709 Strong on Vals LiveCodeBench overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg AC | 9.8% |
| #45 | deepseek-v3 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 9.8% |
| #46 | gpt-4o-mini-2024-07-18 Strong on BigCodeBench Official bigcodebench_complete_pct and BigCodeBench Official bigcodebench_instruct_pct | 9.2% |
| #47 | gemini-3-flash-preview Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct | 9.2% |
| #48 | gpt-5.4-2026-03-05 Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct | 9.2% |
Compare Models
โถRanking diagnostics & missing models
Source lift
Ranked
40
Sources
8
Quality
Low
Vals LiveCodeBench
Vals MedQA
Vals Legal Bench
SWE-bench Verified Leaderboard
Missing frontier models
grok-4-1-fast-non-reasoning
Thin evidence after weightingRank #15
14.9%
โถTaxonomy & task details
Core tasks
Required modes
Domains
Related in Engineering
Component selection assistant
Recommend components under constraints with evidence and tradeoffs.
Simulation setup assistant
Turn design requirements into simulation setup checklists and boundary notes.
Verilog/VHDL generation
Generate RTL code and testbenches from functional specs.