openai
OpenAI: o3 Mini
o3 Mini is a reasoning-optimized model from openai, priced at $1.1/M input and $4.4/M output. It ranked 48th overall out of 52 tested models — near the bottom of our dataset. This low ranking is partly explained by documented quirks: o3 Mini returns empty responses on constrained rewriting, classification, and persona consistency tests due to reasoning token overhead, and also fails on structured output in our testing. These missing scores drag its average down significantly. Its math performance stands out — MATH Level 5 scored 96.5 (rank 6 of 14, per Epoch AI data). Teams should consider this context when evaluating the overall rank: o3 Mini is specialized for reasoning and math tasks, not a general-purpose model in our benchmark profile.
Performance
In our 12-test benchmark suite, o3 Mini's functional strengths are limited by its quirks. Long context (5/5, tied for 1st with 36 others out of 55 tested) and faithfulness (5/5, tied for 1st with 32 others out of 55) are its top internal benchmark scores. Tool calling scored 4/5 (rank 18 of 54) and multilingual 4/5 (rank 36 of 55). Strategic analysis and agentic planning dropped to 2/5 and 3/5 respectively — lower than most models in our suite. Structured output scored 2/5 (rank 54 of 54 — sole holder, dead last in our dataset). On external benchmarks from Epoch AI, o3 Mini scored 96.5 on MATH Level 5 (rank 6 of 14, sole holder) and 76.9 on AIME 2025 (rank 15 of 23, sole holder). The pattern is clear: strong on math reasoning, weak or missing on general capabilities. Missing data (constrained rewriting, classification, persona consistency) is due to documented quirks — the model returned empty responses on these tests.
Pricing
o3 Mini costs $1.1 per million input tokens and $4.4 per million output tokens — identical pricing to o4 Mini. At 1M input + 500K output, total cost is about $3.30. At 10M input / 5M output per month, expect $33/month. At 100M input / 50M output, roughly $330/month. Given the reasoning model quirks (empty responses on several test dimensions, reasoning token overhead), actual effective cost per successful output varies by task type. Compare to o4 Mini ($4.4/M output, rank 15, more capable overall) — both are identically priced, but o4 Mini has significantly fewer operational limitations. The value of o3 Mini is concentrated in math reasoning tasks where its per-token cost is low relative to its MATH Level 5 performance.
openai
OpenAI: o3 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$1.10/MTok
Output
$4.40/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="openai/o3-mini",
messages=[
{"role": "user", "content": "Hello, OpenAI: o3 Mini!"}
],
)
print(response.choices[0].message.content)Recommendation
o3 Mini serves a narrow but real use case: math-intensive applications where per-token cost matters more than general capability breadth. If your workflow centers on mathematical reasoning, symbolic problem solving, or science applications, its MATH Level 5 score of 96.5 (rank 6 of 14, per Epoch AI data) at $4.4/M output is competitive. Who should look elsewhere: for virtually any general-purpose application — classification, document summarization, constrained writing, persona-consistency tasks — o3 Mini's documented failures on those dimensions in our testing make it unsuitable. At the same price point, o4 Mini ($4.4/M output) ranked 15th overall with far broader capability coverage and a matching MATH Level 5 score of 97.8. GPT-5.4 Mini ($4.5/M output, rank 10) offers similar pricing with no quirks. o3 Mini is best understood as a legacy reasoning model with a specific strength profile, not a general-purpose choice.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.