AI Intelligence · Benchmarks

AI Benchmark Center

Live leaderboards across reasoning, coding, math, expert knowledge, and agent performance.

Current Leaders by Category

🧠

Reasoning & Knowledge

GPT-5

92.3

💻

Coding

GPT-5

97.1

📐

Mathematics

GPT-5

91.5

🎓

Expert Knowledge

GPT-5

🤖

Agent Performance

Devin 2

Overall Model Ranking

(avg. normalized score across all benchmarks)

🥇NEW

GPT-5

OpenAI

100.0

avg. score

🥈NEW

Devin 2

Cognition AI

100.0

avg. score

🥉NEW

Gemini Ultra 2

Google

98.6

avg. score

4NEW

Claude 4 Opus

Anthropic

97.9

avg. score

5NEW

Claude 4 Opus + Tools

Anthropic

94.3

avg. score

6NEW

DeepSeek R2

DeepSeek

92.7

avg. score

7NEW

Llama 4 Scout

Reasoning & Knowledge

MMLU · % accuracy

Tests general knowledge, logical reasoning, and multi-step problem solving.

🥇

GPT-5OpenAINEW

May 20, 2025

92.3

🥈

Gemini Ultra 2GoogleNEW

May 10, 2025

91.8

🥉

Claude 4 OpusAnthropicNEW

May 15, 2025

90.1

Llama 4 ScoutMetaNEW

May 5, 2025

85.4

GPT-4oOpenAI

May 13, 2024

88.7

💻

Coding

HumanEval · % pass@1

Measures ability to write correct, efficient code from natural language descriptions.

🥇

GPT-5OpenAINEW

May 20, 2025

97.1

🥈

Claude 4 OpusAnthropicNEW

May 15, 2025

95.2

🥉

Claude 3.5 SonnetAnthropic

Oct 22, 2024

93.7

Gemini Ultra 2GoogleNEW

May 10, 2025

94.7

DeepSeek R2DeepSeekNEW

Apr 28, 2025

92.1

📐

Mathematics

MATH · % accuracy

Tests advanced mathematical reasoning from competition-level to olympiad problems.

🥇

GPT-5OpenAINEW

May 20, 2025

91.5

🥈

DeepSeek R2DeepSeekNEW

Apr 28, 2025

🥉

Gemini Ultra 2GoogleNEW

May 10, 2025

90.2

Claude 4 OpusAnthropicNEW

May 15, 2025

89.3

Qwen 2.5 72BAlibaba

Sep 19, 2024

85.4

🎓

Expert Knowledge

GPQA Diamond · % accuracy

PhD-level questions across biology, chemistry, physics, and CS from human experts.

🥇

GPT-5OpenAINEW

May 20, 2025

🥈

Gemini Ultra 2GoogleNEW

May 10, 2025

87.1

🥉

Claude 4 OpusAnthropicNEW

May 15, 2025

86.5

DeepSeek R2DeepSeekNEW

Apr 28, 2025

71.5

Qwen 2.5 72BAlibaba

Sep 19, 2024

56.4

🤖

Agent Performance

SWE-bench Verified · % resolved

Real-world software engineering tasks — fixing GitHub issues end-to-end.

🥇

Devin 2Cognition AINEW

Apr 1, 2025

🥈

Claude 4 Opus + ToolsAnthropicNEW

May 15, 2025

61.3

🥉

GPT-5 + ToolsOpenAINEW

May 20, 2025

58.7

SWE-agent (Claude)Princeton

Mar 10, 2025

48.2

Claude 3.5 Sonnet + ToolsAnthropic

Oct 22, 2024

About These Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ questions.

HumanEval (Coding)

164 hand-crafted programming challenges. Measures ability to produce correct code from docstrings.

MATH

12,500 competition math problems from AMC, AIME, and AMC 10/12. Tests advanced mathematical reasoning.

SWE-bench Verified

Real GitHub issues from popular open-source repos. Measures end-to-end software engineering capability.