openai

OpenAI: o3 Mini

o3 Mini is a reasoning-optimized model from openai, priced at $1.1/M input and $4.4/M output. It ranked 48th overall out of 52 tested models — near the bottom of our dataset. This low ranking is partly explained by documented quirks: o3 Mini returns empty responses on constrained rewriting, classification, and persona consistency tests due to reasoning token overhead, and also fails on structured output in our testing. These missing scores drag its average down significantly. Its math performance stands out — MATH Level 5 scored 96.5 (rank 6 of 14, per Epoch AI data). Teams should consider this context when evaluating the overall rank: o3 Mini is specialized for reasoning and math tasks, not a general-purpose model in our benchmark profile.

Performance

In our 12-test benchmark suite, o3 Mini's functional strengths are limited by its quirks. Long context (5/5, tied for 1st with 36 others out of 55 tested) and faithfulness (5/5, tied for 1st with 32 others out of 55) are its top internal benchmark scores. Tool calling scored 4/5 (rank 18 of 54) and multilingual 4/5 (rank 36 of 55). Strategic analysis and agentic planning dropped to 2/5 and 3/5 respectively — lower than most models in our suite. Structured output scored 2/5 (rank 54 of 54 — sole holder, dead last in our dataset). On external benchmarks from Epoch AI, o3 Mini scored 96.5 on MATH Level 5 (rank 6 of 14, sole holder) and 76.9 on AIME 2025 (rank 15 of 23, sole holder). The pattern is clear: strong on math reasoning, weak or missing on general capabilities. Missing data (constrained rewriting, classification, persona consistency) is due to documented quirks — the model returned empty responses on these tests.

Pricing

o3 Mini costs $1.1 per million input tokens and $4.4 per million output tokens — identical pricing to o4 Mini. At 1M input + 500K output, total cost is about $3.30. At 10M input / 5M output per month, expect $33/month. At 100M input / 50M output, roughly $330/month. Given the reasoning model quirks (empty responses on several test dimensions, reasoning token overhead), actual effective cost per successful output varies by task type. Compare to o4 Mini ($4.4/M output, rank 15, more capable overall) — both are identically priced, but o4 Mini has significantly fewer operational limitations. The value of o3 Mini is concentrated in math reasoning tasks where its per-token cost is low relative to its MATH Level 5 performance.

openai

OpenAI: o3 Mini

Overall
3.33/5Usable

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
4/5
Agentic Planning
3/5
Structured Output
2/5
Safety Calibration
2/5
Strategic Analysis
2/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
96.5%
AIME 2025
76.9%

Pricing

Input

$1.10/MTok

Output

$4.40/MTok

Context Window200K

modelpicker.net

Real-World Costs

iChat response$0.0024
iBlog post$0.0094
iDocument batch$0.242
iPipeline run$2.42

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="openai/o3-mini",
    messages=[
        {"role": "user", "content": "Hello, OpenAI: o3 Mini!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

o3 Mini serves a narrow but real use case: math-intensive applications where per-token cost matters more than general capability breadth. If your workflow centers on mathematical reasoning, symbolic problem solving, or science applications, its MATH Level 5 score of 96.5 (rank 6 of 14, per Epoch AI data) at $4.4/M output is competitive. Who should look elsewhere: for virtually any general-purpose application — classification, document summarization, constrained writing, persona-consistency tasks — o3 Mini's documented failures on those dimensions in our testing make it unsuitable. At the same price point, o4 Mini ($4.4/M output) ranked 15th overall with far broader capability coverage and a matching MATH Level 5 score of 97.8. GPT-5.4 Mini ($4.5/M output, rank 10) offers similar pricing with no quirks. o3 Mini is best understood as a legacy reasoning model with a specific strength profile, not a general-purpose choice.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.