openai

GPT-4o-mini

GPT-4o-mini is OpenAI's small, cost-efficient model accepting text, image, and file inputs. At $0.15 input / $0.60 output per million tokens, it is priced at the low end of the tested market — identical to Mistral Small 4's pricing. In our 12-benchmark suite, it ranked 46th out of 52 tested models with an average score of 3.42. It delivers above-median results on safety calibration and classification, but falls well below median on creative problem solving, strategic analysis, and faithfulness. External math benchmarks confirm limited reasoning capability: 52.6 on MATH Level 5 (rank 13 of 14) and 6.9 on AIME 2025 (rank 21 of 23).

Performance

GPT-4o-mini's strongest benchmark in our testing is safety calibration (4/5, rank 6 of 55 — among the top performers in the entire suite). Classification also scored 4/5 (tied for 1st with 29 other models out of 53 tested). Tool calling, multilingual, long context, persona consistency, and structured output all scored 4/5 at mid-tier rankings. Notable weaknesses: creative problem solving scored 2/5 (rank 47 of 54), strategic analysis scored 2/5 (rank 44 of 54), and faithfulness scored 3/5 (rank 52 of 55 — near last place). On external benchmarks, it scored 52.6 on MATH Level 5 (rank 13 of 14) and 6.9 on AIME 2025 (rank 21 of 23), both Epoch AI benchmarks — placing it near the bottom among models with math scores. Overall rank: 46 out of 52 tested models.

Pricing

GPT-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens. At 1 million output tokens/month, that is $0.60; at 10 million output tokens, $6.00. Within the OpenAI lineup, it is the lowest-priced option we tested — substantially cheaper than GPT-4o ($10/MTok output, avg 3.50). For high-volume inference tasks like classification, extraction, or routing, the cost is highly competitive. The 128,000-token context window and 16,384-token maximum output accommodate most document-length tasks.

openai

GPT-4o-mini

Overall
3.42/5Usable

Benchmark Scores

Faithfulness
3/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
4/5
Strategic Analysis
2/5
Persona Consistency
4/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
52.6%
AIME 2025
6.9%

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window128K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0013
iDocument batch$0.033
iPipeline run$0.330

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Hello, GPT-4o-mini!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

GPT-4o-mini is the right choice for high-volume, cost-sensitive classification and routing pipelines where safety calibration matters. At $0.60/MTok output with 4/5 on classification and the 6th-best safety calibration score in our suite, it is well-suited for content moderation, input triage, and structured extraction tasks at scale. It is not recommended for reasoning-intensive tasks — MATH Level 5 (52.6, rank 13 of 14) and AIME 2025 (6.9, rank 21 of 23) place it near the bottom of math-capable models. Faithfulness near last place (rank 52 of 55) also rules it out for strict RAG applications. For comparable pricing with better creative and reasoning performance, Mistral Small 4 (avg 3.83, $0.60/MTok output) outscores it on our benchmarks.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions