Most Emotionally Intelligent LLMs (2026)

Most LLM benchmarks measure whether a model can solve math problems or write code. Those are useful signals. But they tell you almost nothing about whether a model can navigate a difficult conversation with a distressed customer, write a character who feels genuinely human, or pick up on the subtext in a message that says "I'm fine" but clearly isn't.

That's what EQ measures — and it's the dimension where model rankings diverge most sharply from raw intelligence.

What EQ Actually Measures

Emotional intelligence in language models comes down to three separable capabilities:

Empathic accuracy — Does the model correctly identify and label what someone is feeling, and does it respond appropriately rather than just technically correctly? A model with high empathic accuracy knows the difference between a user who needs information and a user who needs to feel heard first.

Theory of mind — Can the model track what other people know, believe, want, and intend — and reason about how those mental states differ from its own? This is the cognitive backbone of EQ. Without it, a model can't write convincing dialogue, can't anticipate how advice will land, and can't navigate situations where the spoken and unspoken messages diverge.

Social reasoning — Does the model understand the unwritten rules governing human interaction? Social norms, face-saving, indirect communication, the weight of silence — these are things a high-EQ model navigates fluently.

How We Score It

Our EQ dimension aggregates across several benchmark families:

EQ-Bench is the most direct signal — it presents emotionally complex scenarios and evaluates how accurately the model identifies and responds to the emotional dynamics. Unlike most benchmarks, it's resistant to the "say the right words" failure mode; a model can't score well by parroting therapeutic language without understanding the situation.

Theory of Mind benchmarks test whether a model can model the mental states of characters in a scenario — classic false-belief tasks extended to more naturalistic situations. These are hard to game and strongly predictive of real-world EQ performance.

Social IQa and related datasets test commonsense social reasoning — given a situation, what's the appropriate response, and why? These catch models that are technically smart but socially oblivious.

EQ and IQ are weakly correlated in LLMs — high reasoning ability doesn't predict high emotional intelligence. Some of the most analytically capable models score mediocre on EQ, while some mid-tier reasoning models are surprisingly strong on social and emotional tasks.

Current Rankings

EQ Rankings

Emotional intelligence & social understanding

Top 25 · Live

#	Model	EQ	Confidence	IQ	Accuracy
1	external/openai/gpt-5-2025-08-07	97.8	15%	76.7	87.9
2	zai-org/GLM-4.5	93.3	8%	70.6	80.8
3	zai-org/GLM-4.6	93.3	45%	61.3	83.8
4	external/openai/gpt-4-1	91.8	8%	57.2	65.8
5	meta-llama/Meta-Llama-3-70B-Instruct	91.6	8%	91.9	—
6	deepseek-ai/DeepSeek-V2.5	91.4	8%	—	—
7	external/anthropic/claude-opus-4	91.2	8%	71.3	74.7
8	Qwen/Qwen2-72B-Instruct	90.7	8%	91.1	—
9	external/anthropic/claude-sonnet-4	90.4	8%	57.6	68.2
10	google/gemma-2-27b-it	89.8	8%	67.3	—
11	Qwen/QwQ-32B	89.1	8%	—	—
12	Qwen/Qwen2.5-32B-Instruct	89.1	8%	19.7	—
13	dphn/dolphin-2.2-70b	88.7	8%	—	—
14	external/google/gemini-3-pro-preview	88.7	51%	82.2	88.6
15	Qwen/Qwen2.5-14B-Instruct	88.3	8%	14.6	—
16	OpenLLM-Korea/solar-pro-preview-instruct	87.5	16%	—	—
17	external/openai/gpt-4o	86.6	29%	60.6	69.6
18	external/x-ai/grok-3	86.3	8%	76.5	—
19	microsoft/Phi-3.5-MoE-instruct	85.8	8%	—	—
20	external/xai-org/grok-4-1-fast-reasoning	85.5	45%	56.0	86.8
21	microsoft/Phi-3-medium-4k-instruct	85.1	8%	—	—
22	Qwen/Qwen3-8B	83.2	8%	—	—
23	Qwen/Qwen3-32B	82.8	8%	—	—
24	berkeley-nest/Starling-LM-7B-alpha	82.4	8%	—	—
25	meta-llama/Llama-2-70b-chat-hf	82.0	8%	—	—

What the Data Shows

A few patterns are consistently visible in the EQ rankings:

Instruction-tuned models trained on dialogue heavily outperform base models. This is expected — conversational fine-tuning exposes models to the texture of human emotion in a way that pure text pretraining doesn't. Models trained on diverse human feedback tend to develop better social calibration as a side effect.

Scale helps, but not as much as training distribution. A 70B model trained on impoverished dialogue data will often score lower on EQ than a carefully fine-tuned 13B model. This is the dimension where "who trained it and on what" matters more than raw parameter count.

Over-safety fine-tuning hurts EQ. Models that have been heavily restricted to avoid sensitive topics often lose the ability to engage authentically with emotional content. Emotional intelligence requires the ability to engage with difficult feelings — models trained to deflect or refuse tend to score poorly because they can't meet users where they are.

When EQ Is the Right Signal to Optimize For

EQ should be your primary ranking signal when:

Customer-facing conversational AI — Support bots, assistants, and chatbots where the interaction quality matters as much as the answer quality. Users don't just want the right answer; they want to feel understood.
Mental health and wellness applications — Coaching tools, journaling assistants, and therapeutic support contexts where emotional attunement is the core product requirement.
Creative writing with human characters — Fiction, screenwriting, game dialogue, NPC writing. High-EQ models write characters who feel like people. Low-EQ models write characters who sound like plot functions.
HR and recruiting tools — Resume feedback, interview preparation, and sensitive communication tasks where tone and empathy matter.
Education and tutoring — A tutor that can read student frustration and adapt its approach will always outperform one that just delivers correct information.

EQ score does not predict factual accuracy or code quality. For technical tasks, optimize for IQ and Accuracy instead. Mixing up the signal you're optimizing for is the most common mistake in model selection.

Methodology & Confidence

EQ scores require a minimum of one evidence point per benchmark family and a 0.05 confidence threshold to appear in rankings. Models with sparse coverage are excluded — we don't extrapolate from single-benchmark results.

Confidence scores range from 0–100%. A model with 80%+ confidence has strong multi-source coverage; a model at 30% has limited but non-trivial evidence. When in doubt, weight higher-confidence entries more heavily in your selection process.

Full scoring methodology is documented at /methodology.