Best AI for math
Arithmetic, proofs, optimization, and symbolic reasoning.
Math benchmarks are the cleanest signal we have. Answers are verifiable, contamination is detectable, and reasoning-RL fine-tuning produces outsized gains — which is why thinking-style models (o-series, R1, Gemini thinking variants) dominate.
What matters: AIME, MATH Level 5, and GPQA-Diamond. If your application does anything quantitative — finance, logistics, scientific computing — pay the premium for a reasoning model. The gap between a frontier reasoning model and a frontier chat model on hard math problems can be 20+ percentage points.
Our math rank weights the math benchmark (2.5×), reasoning (1.5×), and structured output (0.5×). The structured output weight matters for applications that pipe results into downstream systems — a model that gets the right answer but formats it wrong is still broken.
Full rankings
All 13 models, scored for math
| # | Model | Provider | Task score | $/in | $/out | Context |
|---|---|---|---|---|---|---|
| 01 | GPT-5 | OOpenAI | 98.1% | $1.25 | $10.00 | 400K |
| 02 | GPT-5 Mini | OOpenAI | 97.8% | $0.250 | $2.00 | 400K |
| 03 | o4 Mini | OOpenAI | 97.8% | $1.10 | $4.40 | 200K |
| 04 | o3 | OOpenAI | 97.8% | $2.00 | $8.00 | 200K |
| 05 | R1 0528 | DDeepSeek | 96.6% | $0.500 | $2.15 | 164K |
| 06 | GPT-5 Nano | OOpenAI | 95.2% | $0.050 | $0.400 | 400K |
| 07 | R1 | DDeepSeek | 93.1% | $0.700 | $2.50 | 164K |
| 08 | GPT-4.1 Mini | OOpenAI | 87.3% | $0.400 | $1.60 | 1.0M |
| 09 | GPT-4.1 | OOpenAI | 83.0% | $2.00 | $8.00 | 1.0M |
| 10 | GPT-4.1 Nano | OOpenAI | 70.0% | $0.100 | $0.400 | 1.0M |
| 11 | GPT-4o | OOpenAI | 53.3% | $2.50 | $10.00 | 128K |
| 12 | GPT-4o-mini | OOpenAI | 52.6% | $0.150 | $0.600 | 128K |
| 13 | Llama 3.3 70B Instruct | MMeta | 41.6% | $0.100 | $0.320 | 131K |
Pricing — top 5 for math
The best AI for math changes every month.
We'll email you when rankings shift, new models hit the top 5, or pricing cuts reshuffle the value leaders.