Model Profile

openai/gpt-4.1

Name: openai/gpt-4.1
Rating: 2.6 (178 reviews)
Author: openai

External Benchmark Shadowexternal_benchmark_shadowpublic

4,096 ctx

Use this page to decide where this model is a strong fit. Rankings below are benchmark-backed by use case, with explicit confidence and contributor metrics.

Identity

ID: external/openai/gpt-4-1

Author: openai

Origin: external_benchmark_shadow

Arch: unknown

Benchmark Coverage

Scored use cases: 12

Avg confidence: 23.5%

Evidence points: 178

Raw rows: 87

Weighted rows: 25

Catalog Metadata

Parameters: unknown

Context window: 4096

Downloads: 0

Price / 1M tokens: $3.50 (blended 3:1)

Intelligence Profile

Dimension Breakdown

IQ4 benchmarks

57.2%*

EQ1 benchmark

91.8%*

Accuracy2 benchmarks

65.8%*

Creativity0 benchmarks

No creativity benchmarks found

Insufficient data

Based0 benchmarks

No based benchmarks found

Insufficient data

* Low confidence — limited benchmark evidence for this dimension

3/5 dimensions scored · Last updated Apr 21, 2026

Benchmark Signals

Click through to the benchmark source behind this model profile.

DuckDB NSQL Leaderboard

all_execution_accuracy

10.1%

Normalized value 96.2% · confidence 100.0%

Strongest impact in Metric definition workshop

duckdb_nsql_leaderboard.all_execution_accuracy · Apr 1, 2026

LanguageBench Translation Official (Split)

translation_to:bleu

6.2%

Normalized value 82.7% · confidence 100.0%

Strongest impact in Archaic and historical translation

languagebench_translation_official.translation_to_bleu · Apr 1, 2026

LanguageBench

overall:mean

5.1%

Normalized value 98.3% · confidence 100.0%

Strongest impact in Archaic and historical translation

languagebench.overall_mean · Apr 1, 2026

SWE-bench Verified Leaderboard

swe_verified_resolved_pct

5.1%

Normalized value 94.1% · confidence 100.0%

Strongest impact in Verilog/VHDL generation

swebench_verified_official.swe_verified_resolved_pct · Apr 1, 2026

LanguageBench Grammar/Clarity Official (Split)

grammar_clarity_score_pct

3.5%

Normalized value 95.8% · confidence 100.0%

Strongest impact in Translation and localization

languagebench_grammar_clarity_official.grammar_clarity_score_pct · Apr 1, 2026

Aider Polyglot Leaderboard

percent_correct_pct

2.9%

Normalized value 88.2% · confidence 100.0%

Strongest impact in Verilog/VHDL generation

aider_polyglot.percent_correct_pct · Apr 1, 2026

Some fit rows have limited benchmark evidence.

9 of 12 scored use cases have low confidence or thin contributor coverage.

Coverage Diagnostics

actively scored

Use-Case Scores

101

Total Measurements

Weighted Measurements

Weighted Sources

Raw Source Coverage

duckdb_nsql_leaderboard 12mws_vision_bench 12languagebench 10baxbench_leaderboard 9sciarena_leaderboard 7browsecomp_plus_leaderboard 4

Weighted Source Coverage

languagebench 3languagebench_translation_official 3lexam_leaderboard 3vader_leaderboard 3aider_polyglot 2duckdb_nsql_leaderboard 2

Best Use Cases for This Model

Use Case	Vertical	Score	Confidence	Evidence	Top Contributor
Archaic and historical translation use_case.history.archaic_translation	history_linguistics	26.2%	33.1%	17	LanguageBench Translation Official (Split): translation_to:bleu
Legal translation use_case.legal.legal_translation	legal	24.7%	29.4%	17	LanguageBench Translation Official (Split): translation_to:bleu
Brand voice localization use_case.mkt.brand_voice_localization	marketing_sales	20.9%	24.5%	14	LanguageBench Translation Official (Split): translation_to:bleu
Historical document summarization use_case.history.historical_doc_summarization	history_linguistics	20.6%	26.3%	16	LanguageBench: overall:mean
Verilog/VHDL generation use_case.eda.verilog_generation	engineering	19.6%	22.3%	13	SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Metric definition workshop use_case.data.metric_definition_workshop	data_analytics	17.8%	24.1%	13	DuckDB NSQL Leaderboard: all_execution_accuracy
Integration test generation use_case.dev.integration_tests	developer_tools	16.6%	19.0%	13	SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Grammar and writing coach use_case.lang.grammar_coach	education	16.6%	21.7%	16	LanguageBench Translation Official (Split): translation_to:bleu
Data quality assistant use_case.data.data_quality_assistant	data_analytics	16.3%	22.0%	13	DuckDB NSQL Leaderboard: all_execution_accuracy
Translation and localization use_case.business.translation_localization	business_productivity	16.0%	18.9%	14	LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct
Contract term extraction use_case.legal.contract_term_extraction	legal	15.9%	20.1%	16	LEXam Leaderboard: average_score_pct
Clause playbook check use_case.legal.playbook_clause_check	legal	15.9%	20.1%	16	LEXam Leaderboard: average_score_pct