AI Model Benchmarks

Compare foundation model performance across coding, reasoning, math, and multimodal tasks. Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and internal evaluations.

#	Model ↕	Provider ↕	Overall Score ↓	Context ↕	$/1M tokens ↕	Released ↕
1	Claude 4 Opus	Anthropic	96.8	200K	$15/$75	Sep 2025
2	Gemini 2.5 Pro	Google	95.9	2.0M	$2.5/$10	Dec 2025
3	GPT-4.5	OpenAI	95.4	128K	$10/$30	Aug 2025
4	Llama 4 Maverick	Meta	93.5	1.0M	$0.2/$0.6	Apr 2025
5	Grok 3	xAI	93.0	128K	$3/$15	Feb 2025
6	DeepSeek R1	DeepSeek	92.8	128K	$0.55/$2.19	Jan 2025
7	Claude 3.5 Sonnet	Anthropic	92.1	200K	$3/$15	Jun 2024
8	GPT-4o	OpenAI	91.8	128K	$5/$15	May 2024
9	Qwen 2.5 72B	Alibaba	90.2	128K	$0.4/$1.2	Jan 2025
10	Mistral Large 2	Mistral	89.5	128K	$2/$6	Nov 2024
11	Llama 3.1 405B	Meta	88.0	128K	Open Source	Jul 2024
12	Command R+	Cohere	85.0	128K	$2.5/$10	Apr 2024

Scores are composite aggregates from MMLU-Pro, HumanEval, ARC, GPQA, and public evaluations. Pricing shown as input/output per 1M tokens. Click a model name to view its tool page.

LMSYS Arena HuggingFace LLM Aider Coding