AI Model Benchmarks

Compare foundation model performance across coding, reasoning, math, and multimodal tasks. Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and internal evaluations.

#Model Provider Overall Score PerformanceContext $/1M tokens Released
1Claude 4 OpusAnthropic96.8
200K$15/$75Sep 2025
2Gemini 2.5 ProGoogle95.9
2.0M$2.5/$10Dec 2025
3GPT-4.5OpenAI95.4
128K$10/$30Aug 2025
4Llama 4 MaverickMeta93.5
1.0M$0.2/$0.6Apr 2025
5Grok 3xAI93.0
128K$3/$15Feb 2025
6DeepSeek R1DeepSeek92.8
128K$0.55/$2.19Jan 2025
7Claude 3.5 SonnetAnthropic92.1
200K$3/$15Jun 2024
8GPT-4oOpenAI91.8
128K$5/$15May 2024
9Qwen 2.5 72BAlibaba90.2
128K$0.4/$1.2Jan 2025
10Mistral Large 2Mistral89.5
128K$2/$6Nov 2024
11Llama 3.1 405BMeta88.0
128KOpen SourceJul 2024
12Command R+Cohere85.0
128K$2.5/$10Apr 2024

Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and public evaluations. Pricing shown as input/output per 1M tokens. Click a model name to view its tool page.