AI Benchmark Center
Live leaderboards across reasoning, coding, math, expert knowledge, and agent performance.
Current Leaders by Category
Overall Model Ranking
(avg. normalized score across all benchmarks)Reasoning & Knowledge
MMLU · % accuracy
Tests general knowledge, logical reasoning, and multi-step problem solving.
May 20, 2025
May 10, 2025
May 15, 2025
May 5, 2025
May 13, 2024
Coding
HumanEval · % pass@1
Measures ability to write correct, efficient code from natural language descriptions.
May 20, 2025
May 15, 2025
Oct 22, 2024
May 10, 2025
Apr 28, 2025
Mathematics
MATH · % accuracy
Tests advanced mathematical reasoning from competition-level to olympiad problems.
May 20, 2025
Apr 28, 2025
May 10, 2025
May 15, 2025
Sep 19, 2024
Expert Knowledge
GPQA Diamond · % accuracy
PhD-level questions across biology, chemistry, physics, and CS from human experts.
May 20, 2025
May 10, 2025
May 15, 2025
Apr 28, 2025
Sep 19, 2024
Agent Performance
SWE-bench Verified · % resolved
Real-world software engineering tasks — fixing GitHub issues end-to-end.
Apr 1, 2025
May 15, 2025
May 20, 2025
Mar 10, 2025
Oct 22, 2024
About These Benchmarks
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ questions.
HumanEval (Coding)
164 hand-crafted programming challenges. Measures ability to produce correct code from docstrings.
MATH
12,500 competition math problems from AMC, AIME, and AMC 10/12. Tests advanced mathematical reasoning.
SWE-bench Verified
Real GitHub issues from popular open-source repos. Measures end-to-end software engineering capability.