Leaderboard

How frontier models rank on human values.

Rankings are built from human pairwise votes. In each battle, two models respond to the same prompt — voters choose which response better embodies the target value.

4 value dimensions · 6 modelsRead the EigenBench paper →Vote in a Battle

EigenBench rankings

Bradley-Terry Elo computed from human pairwise votes.

Updated every 5 min · live from votes

Kindness

How warmly each model responds to human-centered prompts.

Model	Win Rate	BT Elo
1anthropic/claude-3-5-haiku	100%	1829
2Qwen 2.5 72B	100%	1203
3Claude 3.5 Haiku	75%	1078
4Llama 3.3 70B	33%	966
5GPT-4o Mini	40%	862
6Mistral Small 3.1	0%	300
7Gemini 2.0 Flash	0%	300

Conservatism

Preference for stability, institutions, and incremental change.

Model	Win Rate	BT Elo
1Llama 3.3 70B	100%	1658
2Qwen 2.5 72B	100%	1658
3Gemini 2.0 Flash	25%	855
4GPT-4o Mini	33%	300
5Claude 3.5 Haiku	50%	300

Deep Ecology

Alignment with ecological stewardship and planet-first values.

Not enough votes yet — keep judging to unlock rankings.

Loyalty

Commitment to sustained relationships and group solidarity.

Model	Win Rate	BT Elo
1Gemini 2.0 Flash	100%	1765
2Qwen 2.5 72B	100%	1205
3Claude 3.5 Haiku	100%	1205
4Llama 3.3 70B	50%	971
5Mistral Small 3.1	0%	300

Cross-dimension performance

How each model ranks across all four value dimensions. Darker = higher ranked.

Model	Kindness	Conservatism	Deep Ecology	Loyalty	Avg Rank
Qwen 2.5 72B	#2 1203	#2 1658	—	#2 1205	2.0
Llama 3.3 70B	#4 966	#1 1658	—	#4 971	3.0
Claude 3.5 Haiku	#3 1078	#5 300	—	#3 1205	3.7
Gemini 2.0 Flash	#7 300	#3 855	—	#1 1765	3.7
GPT-4o Mini	#5 862	#4 300	—	—	4.5
Mistral Small 3.1	#6 300	—	—	#5 300	5.5

Head-to-head results

Win rate of the row model vs each column opponent. Ties are split 50/50. Darker = higher win rate.

Kindness

Row wins vs column

	4oMi	Haik	Gmni	Llma	Mist	Qwen
GPT-4o Mini	—	0% 3v	—	100% 1v	100% 1v	—
Claude 3.5 Haiku	100% 3v	—	—	0% 1v	—	—
Gemini 2.0 Flash	—	—	—	—	—	0% 1v
Llama 3.3 70B	0% 1v	100% 1v	—	—	—	—
Mistral Small 3.1	0% 1v	—	—	—	—	—
Qwen 2.5 72B	—	—	100% 1v	—	—	—

Conservatism

Row wins vs column

	4oMi	Haik	Gmni	Llma	Mist	Qwen
GPT-4o Mini	—	50% 2v	0% 1v	—	—	—
Claude 3.5 Haiku	50% 2v	—	—	—	—	—
Gemini 2.0 Flash	100% 1v	—	—	0% 2v	—	0% 1v
Llama 3.3 70B	—	—	100% 2v	—	—	—
Mistral Small 3.1	—	—	—	—	—	—
Qwen 2.5 72B	—	—	100% 1v	—	—	—

Deep Ecology

Row wins vs column

	4oMi	Haik	Gmni	Llma	Mist	Qwen
GPT-4o Mini	—	—	—	—	—	—
Claude 3.5 Haiku	—	—	—	—	—	—
Gemini 2.0 Flash	—	—	—	—	—	—
Llama 3.3 70B	—	—	—	—	—	—
Mistral Small 3.1	—	—	—	—	—	—
Qwen 2.5 72B	—	—	—	—	—	—

Loyalty

Row wins vs column

	4oMi	Haik	Gmni	Llma	Mist	Qwen
GPT-4o Mini	—	—	—	—	—	—
Claude 3.5 Haiku	—	—	—	—	100% 1v	—
Gemini 2.0 Flash	—	—	—	100% 1v	—	—
Llama 3.3 70B	—	—	0% 1v	—	100% 1v	—
Mistral Small 3.1	—	0% 1v	—	0% 1v	—	0% 1v
Qwen 2.5 72B	—	—	—	—	100% 1v	—

Contribute

Your votes shape the rankings.

Head to the battle page, read two model responses side by side, and pick which one better reflects the target value. Every vote updates the leaderboard.

Go to battle →