Compare foundation model performance across coding, reasoning, math, and multimodal tasks. Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and internal evaluations.
| # | Model ↕ | Provider ↕ | Overall Score ↓ | Performance | Context ↕ | $/1M tokens ↕ | Released ↕ |
|---|---|---|---|---|---|---|---|
| 1 | Claude 4 Opus | Anthropic | 96.8 | 200K | $15/$75 | Sep 2025 | |
| 2 | Gemini 2.5 Pro | 95.9 | 2.0M | $2.5/$10 | Dec 2025 | ||
| 3 | GPT-4.5 | OpenAI | 95.4 | 128K | $10/$30 | Aug 2025 | |
| 4 | Llama 4 Maverick | Meta | 93.5 | 1.0M | $0.2/$0.6 | Apr 2025 | |
| 5 | Grok 3 | xAI | 93.0 | 128K | $3/$15 | Feb 2025 | |
| 6 | DeepSeek R1 | DeepSeek | 92.8 | 128K | $0.55/$2.19 | Jan 2025 | |
| 7 | Claude 3.5 Sonnet | Anthropic | 92.1 | 200K | $3/$15 | Jun 2024 | |
| 8 | GPT-4o | OpenAI | 91.8 | 128K | $5/$15 | May 2024 | |
| 9 | Qwen 2.5 72B | Alibaba | 90.2 | 128K | $0.4/$1.2 | Jan 2025 | |
| 10 | Mistral Large 2 | Mistral | 89.5 | 128K | $2/$6 | Nov 2024 | |
| 11 | Llama 3.1 405B | Meta | 88.0 | 128K | Open Source | Jul 2024 | |
| 12 | Command R+ | Cohere | 85.0 | 128K | $2.5/$10 | Apr 2024 |
Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and public evaluations. Pricing shown as input/output per 1M tokens. Click a model name to view its tool page.