Most Creative LLMs (2026)

Creativity is the hardest dimension to benchmark — and the one where most evaluations get it wrong. The usual failure mode is measuring fluency instead of originality: a model that produces grammatically perfect, coherent prose scores well even if everything it writes is formulaic, predictable, and safe. Fluency is not creativity. In fact, high fluency combined with low originality is the exact failure mode that makes a model useless for creative work — it produces content that reads well but says nothing new.

Our Creativity score specifically tries to penalize this. The models that score highest here don't just produce clean output; they produce output that surprises.

What Creativity Actually Means in This Context

Creativity in language models is multidimensional, and the dimensions don't always correlate:

Originality — Does the model generate content that diverges from the obvious? Given a prompt, does it explore the interesting corner of the space, or does it produce the median answer? High-originality models write the unexpected character motivation, the non-obvious plot turn, the metaphor you haven't seen before.

Coherent novelty — Randomness is not creativity. A model that produces incoherent outputs has no creativity; it has noise. True creative output is novel and internally consistent — surprising in a way that makes sense in retrospect.

Voice and register — The best creative writers don't sound like everyone else. High-creativity models develop consistent stylistic choices rather than averaging across all the writing they've seen. This shows up in sentence rhythm, vocabulary selection, and the specific texture of the prose.

Generative range — Can the model produce genuinely different outputs on different passes of the same prompt? A creative model has range; a low-creativity model converges to the same response with minor variation.

Creativity and Accuracy are the most inversely correlated dimensions we track. The qualities that make a model a great creative writer — willingness to speculate, to fill in gaps imaginatively, to make confident leaps — are exactly the qualities that produce hallucination in factual contexts. This isn't a flaw; it's a feature. Know which use case you're in.

How We Score It

Creativity is harder to benchmark objectively than IQ or Accuracy, which is why our confidence scores for this dimension tend to be lower. The benchmark families we use:

MT-Bench creative writing tasks are human-evaluated open-ended generation tasks that specifically assess quality of creative output rather than factual correctness. They test story continuation, character writing, and imaginative scenario generation.

Open-ended generation benchmarks evaluate output diversity and originality at scale — whether a model consistently produces varied, surprising content or converges to safe defaults.

Dialogue quality evaluations — particularly relevant for NPC and character writing — assess whether a model can sustain a consistent, interesting voice across a conversational exchange rather than drifting toward generic assistant-speak.

Current Rankings

Creativity Rankings

Creative expression & generative quality

Top 25 · Live

#	Model	Creativity	Confidence	IQ	EQ
1	external/xai/grok-4-0709	97.1	35%	66.9	69.1
2	external/anthropic/claude-opus-4-6	95.5	35%	85.5	—
3	external/kimi/kimi-k2-5-thinking	92.1	35%	68.0	0.0
4	external/openai/gpt-4o-2024-05-13	89.0	35%	58.9	—
5	zai-org/GLM-5	86.9	35%	67.2	—
6	external/anthropic/claude-sonnet-4-6	86.4	35%	85.1	71.4
7	deepseek-ai/DeepSeek-V3.2-Speciale	86.1	35%	—	—
8	external/openai/gpt-5-4-2026-03-05	84.9	35%	89.2	—
9	external/openai/gpt-4-1-20250414	84.6	35%	68.6	24.5
10	zai-org/GLM-4.5	84.3	35%	70.6	93.3
11	moonshotai/Kimi-K2-Thinking	82.9	35%	—	—
12	external/google/gemini-2-5-pro	82.8	35%	59.2	76.1
13	external/x-ai/grok-3	82.1	35%	76.5	86.3
14	external/openai/o3-20250416	80.8	35%	81.4	56.8
15	external/kimi/kimi-k2-thinking	80.6	35%	64.4	—
16	moonshotai/Kimi-K2-Instruct	80.4	35%	59.1	65.4
17	external/openai/gpt-5-2025-08-07	80.1	35%	76.7	97.8
18	external/google/gemini-3-pro-preview	79.5	35%	82.2	88.7
19	external/google/gemini-3-1-pro-preview	79.4	35%	89.7	—
20	zai-org/GLM-4.7	78.2	35%	30.4	—
21	moonshotai/Kimi-K2-Instruct-0905	78.0	35%	—	—
22	external/anthropic/claude-opus-4-5-20251101	77.9	35%	78.1	12.9
23	zai-org/GLM-4.6	77.1	35%	61.3	93.3
24	external/openai/gpt-4o	74.3	23%	60.6	86.6
25	external/anthropic/claude-sonnet-4	72.8	35%	57.6	90.4

What the Data Shows

Larger models are not consistently more creative. Unlike IQ — where parameter count has a meaningful positive effect — creativity doesn't scale cleanly with model size. Some of the most creative models are mid-sized; some of the largest models produce the safest, most formulaic output. Training distribution and fine-tuning objectives matter more than scale for creativity.

Over-alignment suppresses creativity. This is one of the most consistent findings in the data. Models that have been heavily fine-tuned for safety and helpfulness tend to produce more conservative, formulaic creative output. They've been trained to stay close to the center of human preferences — which is exactly the wrong objective for creative work, where the interesting output is at the edges.

Instruction-following and creativity trade off at the margin. Models that rigidly follow every instruction constraint produce tighter creative output but less interesting creative output. The best creative models are those that understand the spirit of a prompt rather than the letter — they know when to follow the brief and when to surprise you within it.

Roleplay and persona consistency are strong creativity predictors. Models that score high on creativity can maintain a consistent character voice over a long exchange. This turns out to be a good proxy because it requires both originality (the voice needs to be interesting) and coherent novelty (the voice needs to stay consistent).

When Creativity Is the Right Signal

Creativity should be your primary dimension when:

Fiction and longform creative writing — Novels, short stories, genre fiction. The quality ceiling here is set entirely by creativity.
Screenwriting and dialogue — Script dialogue specifically requires both character voice (creativity) and interpersonal dynamics (EQ). Check both dimensions for screenwriting use cases.
Game writing and NPC dialogue — Game characters need to feel like distinct people across potentially hundreds of interactions. High-creativity models do this; low-creativity models make every NPC sound the same.
Marketing and brand voice — Copy that stands out requires originality. A model that produces "compelling" but generic ad copy hasn't done the job.
Brainstorming and ideation — When you need the non-obvious idea, not validation of the obvious one.

Creative output quality is harder to evaluate systematically than factual output quality. Confidence scores for the Creativity dimension tend to be lower than for IQ or Accuracy — there are fewer well-established benchmarks with strong inter-rater reliability. Where confidence is below 40%, treat rankings as directional rather than definitive.

Methodology & Confidence

Because creativity evaluation involves more human judgment than other dimensions, we apply a higher threshold for source reliability in the Creativity dimension. Sources without documented inter-rater agreement are excluded. Full methodology at /methodology.