AI Intelligence · Benchmarks

AI Benchmark Center

Live leaderboards across reasoning, coding, math, expert knowledge, and agent performance.

Current Leaders by Category

🧠
Reasoning & Knowledge
GPT-5
92.3
💻
Coding
GPT-5
97.1
📐
Mathematics
GPT-5
91.5
🎓
Expert Knowledge
GPT-5
88
🤖
Agent Performance
Devin 2
65

Overall Model Ranking

(avg. normalized score across all benchmarks)
🥇NEW
GPT-5
OpenAI
100.0
avg. score
🥈NEW
Devin 2
Cognition AI
100.0
avg. score
🥉NEW
Gemini Ultra 2
Google
98.6
avg. score
4NEW
Claude 4 Opus
Anthropic
97.9
avg. score
5NEW
Claude 4 Opus + Tools
Anthropic
94.3
avg. score
6NEW
DeepSeek R2
DeepSeek
92.7
avg. score
7NEW
Llama 4 Scout
Meta
92.0
avg. score
8NEW
GPT-5 + Tools
OpenAI
90.3
avg. score
🧠

Reasoning & Knowledge

MMLU · % accuracy

Tests general knowledge, logical reasoning, and multi-step problem solving.

🥇
GPT-5OpenAINEW

May 20, 2025

92.3
🥈
Gemini Ultra 2GoogleNEW

May 10, 2025

91.8
🥉
Claude 4 OpusAnthropicNEW

May 15, 2025

90.1
4
Llama 4 ScoutMetaNEW

May 5, 2025

85.4
5
GPT-4oOpenAI

May 13, 2024

88.7
💻

Coding

HumanEval · % pass@1

Measures ability to write correct, efficient code from natural language descriptions.

🥇
GPT-5OpenAINEW

May 20, 2025

97.1
🥈
Claude 4 OpusAnthropicNEW

May 15, 2025

95.2
🥉
Claude 3.5 SonnetAnthropic

Oct 22, 2024

93.7
4
Gemini Ultra 2GoogleNEW

May 10, 2025

94.7
5
DeepSeek R2DeepSeekNEW

Apr 28, 2025

92.1
📐

Mathematics

MATH · % accuracy

Tests advanced mathematical reasoning from competition-level to olympiad problems.

🥇
GPT-5OpenAINEW

May 20, 2025

91.5
🥈
DeepSeek R2DeepSeekNEW

Apr 28, 2025

91
🥉
Gemini Ultra 2GoogleNEW

May 10, 2025

90.2
4
Claude 4 OpusAnthropicNEW

May 15, 2025

89.3
5
Qwen 2.5 72BAlibaba

Sep 19, 2024

85.4
🎓

Expert Knowledge

GPQA Diamond · % accuracy

PhD-level questions across biology, chemistry, physics, and CS from human experts.

🥇
GPT-5OpenAINEW

May 20, 2025

88
🥈
Gemini Ultra 2GoogleNEW

May 10, 2025

87.1
🥉
Claude 4 OpusAnthropicNEW

May 15, 2025

86.5
4
DeepSeek R2DeepSeekNEW

Apr 28, 2025

71.5
5
Qwen 2.5 72BAlibaba

Sep 19, 2024

56.4
🤖

Agent Performance

SWE-bench Verified · % resolved

Real-world software engineering tasks — fixing GitHub issues end-to-end.

🥇
Devin 2Cognition AINEW

Apr 1, 2025

65
🥈
Claude 4 Opus + ToolsAnthropicNEW

May 15, 2025

61.3
🥉
GPT-5 + ToolsOpenAINEW

May 20, 2025

58.7
4
SWE-agent (Claude)Princeton

Mar 10, 2025

48.2
5
Claude 3.5 Sonnet + ToolsAnthropic

Oct 22, 2024

49

About These Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ questions.

HumanEval (Coding)

164 hand-crafted programming challenges. Measures ability to produce correct code from docstrings.

MATH

12,500 competition math problems from AMC, AIME, and AMC 10/12. Tests advanced mathematical reasoning.

SWE-bench Verified

Real GitHub issues from popular open-source repos. Measures end-to-end software engineering capability.