Menu

Model Profile

gpt-oss-120b

4,096 ctxOpen weights

Use this page to decide where this model is a strong fit. Rankings below are benchmark-backed by use case, with explicit confidence and contributor metrics.

Identity

ID: openai/gpt-oss-120b

Author: openai

Origin: huggingface_catalog

Arch: unknown

Benchmark Coverage

Scored use cases: 4

Avg confidence: 10.6%

Evidence points: 24

Raw rows: 25

Weighted rows: 8

Catalog Metadata

Parameters: unknown

Context window: 4096

Downloads: 3,477,982

Intelligence Profile

Dimension Breakdown

IQ1 benchmark

63.6%*

EQ0 benchmarks

No eq benchmarks found

Insufficient data

Accuracy0 benchmarks

No accuracy benchmarks found

Insufficient data

Creativity0 benchmarks

No creativity benchmarks found

Insufficient data

Based0 benchmarks

No based benchmarks found

Insufficient data

* Low confidence — limited benchmark evidence for this dimension

1/5 dimensions scored · Last updated Apr 21, 2026

Benchmark Signals

Click through to the benchmark source behind this model profile.

LEXam Leaderboard

average_score_pct

Normalized value 63.6% · confidence 100.0%

Strongest impact in Contract Drafting & Redlining

lexam_leaderboard.average_score_pct · Mar 31, 2026

LEXam Leaderboard

open_question_judge_score_pct

Normalized value 66.6% · confidence 100.0%

Strongest impact in Contract redline summary

lexam_leaderboard.open_question_judge_score_pct · Mar 31, 2026

SciArena Leaderboard

rating_elo

Normalized value 62.6% · confidence 100.0%

Strongest impact in Contract redline summary

sciarena_leaderboard.rating_elo · Apr 1, 2026

LEXam Leaderboard

mcq_accuracy_pct

Normalized value 62.0% · confidence 100.0%

Strongest impact in Contract redline summary

lexam_leaderboard.mcq_accuracy_pct · Mar 31, 2026

Aider Polyglot Leaderboard

percent_correct_pct

Normalized value 44.4% · confidence 100.0%

Strongest impact in Contract redline summary

aider_polyglot.percent_correct_pct · Apr 1, 2026

BRIDGE Medical Leaderboard

average_performance_pct

Normalized value 68.6% · confidence 100.0%

Strongest impact in Contract redline summary

bridge_medical_leaderboard.average_performance_pct · Apr 1, 2026

Some fit rows have limited benchmark evidence.

4 of 4 scored use cases have low confidence or thin contributor coverage.

Coverage Diagnostics

actively scored

Use-Case Scores

4

Total Measurements

25

Weighted Measurements

8

Weighted Sources

4

Raw Source Coverage

bridge_medical_leaderboard 9sciarena_leaderboard 7aider_polyglot 3lexam_leaderboard 3openrouter_models 3

Weighted Source Coverage

lexam_leaderboard 3aider_polyglot 2bridge_medical_leaderboard 2sciarena_leaderboard 1

Best Use Cases for This Model

Use Case	Vertical	Score	Confidence	Evidence	Top Contributor
Contract Drafting & Redlining use_case.legal.contract_drafting	legal	7.2%	11.4%	6	LEXam Leaderboard: average_score_pct
Contract redline summary use_case.legal.contract_redline_summary	legal	6.7%	10.6%	6	LEXam Leaderboard: average_score_pct
Contract term extraction use_case.legal.contract_term_extraction	legal	6.5%	10.3%	6	LEXam Leaderboard: average_score_pct
Clause playbook check use_case.legal.playbook_clause_check	legal	6.5%	10.3%	6	LEXam Leaderboard: average_score_pct