Devstral Medium vs GPT-5.4 Nano
GPT-5.4 Nano is the stronger choice for most workloads: in our testing it outscored Devstral Medium on 9 of 12 benchmarks while costing less — $0.20/$1.25 per million input/output tokens versus $0.40/$2.00. Devstral Medium's only benchmark win is classification (4 vs 3), making it a narrow alternative for routing and categorization tasks. For general-purpose use, agentic pipelines, or anything touching structured output, strategic reasoning, or safety, GPT-5.4 Nano is the clearer pick.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
openai
GPT-5.4 Nano
Benchmark Scores
External Benchmarks
Pricing
Input
$0.200/MTok
Output
$1.25/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, GPT-5.4 Nano wins 9 benchmarks, Devstral Medium wins 1, and they tie on 2.
Where GPT-5.4 Nano leads:
- Structured output (5 vs 4): GPT-5.4 Nano ties for 1st among 54 models; Devstral Medium sits at rank 26. For any pipeline that depends on reliable JSON schema compliance, this gap matters.
- Strategic analysis (5 vs 2): GPT-5.4 Nano ties for 1st among 54 models; Devstral Medium ranks 44th. A 3-point gap on nuanced tradeoff reasoning is significant — Devstral Medium's 2/5 here falls below the median (p50 = 4), meaning it underperforms most tested models on this dimension.
- Constrained rewriting (4 vs 3): GPT-5.4 Nano ranks 6th of 53; Devstral Medium is at 31st.
- Creative problem solving (4 vs 2): GPT-5.4 Nano ranks 9th of 54; Devstral Medium ranks 47th — near the bottom of the field. Devstral Medium's 2/5 is well below the p25 of 3.
- Tool calling (4 vs 3): GPT-5.4 Nano ranks 18th of 54; Devstral Medium ranks 47th. For agentic workflows where function selection and argument accuracy matter, Devstral Medium's score is a liability.
- Long context (5 vs 4): GPT-5.4 Nano ties for 1st among 55 models. Devstral Medium scores 4 but ranks 38th — workable, but not a strength.
- Safety calibration (3 vs 1): GPT-5.4 Nano ranks 10th of 55; Devstral Medium ranks 32nd with a score of 1/5, below the p25 of 1. This is the most concerning gap — Devstral Medium's safety calibration score suggests it may over-refuse or under-refuse relative to appropriate thresholds.
- Persona consistency (5 vs 3): GPT-5.4 Nano ties for 1st among 53 models; Devstral Medium ranks 45th.
- Multilingual (5 vs 4): GPT-5.4 Nano ties for 1st among 55 models; Devstral Medium ranks 36th.
Where Devstral Medium leads:
- Classification (4 vs 3): Devstral Medium ties for 1st among 53 models — a genuine strength. GPT-5.4 Nano ranks 31st. If categorization and routing are your core use case, Devstral Medium has a real edge here.
Ties:
- Faithfulness (4 vs 4): Both rank 34th of 55 — identical performance on sticking to source material.
- Agentic planning (4 vs 4): Both rank 16th of 54 — evenly matched on goal decomposition.
External benchmark: GPT-5.4 Nano scores 87.8% on AIME 2025 (Epoch AI), ranking 8th of 23 models with that data. Devstral Medium has no AIME 2025 score in our data. The 87.8% result places GPT-5.4 Nano above the p50 of 83.9% for models with that benchmark, confirming solid math reasoning performance.
Pricing Analysis
GPT-5.4 Nano costs $0.20/M input and $1.25/M output. Devstral Medium costs $0.40/M input and $2.00/M output — 2× more expensive on input and 60% more on output. At 1M output tokens/month, that's $1.25 vs $2.00 — a $0.75 difference that barely registers. At 10M output tokens it's $12.50 vs $20.00 — a $7.50 gap that starts to matter for tighter budgets. At 100M output tokens/month — typical for a production API product — you're paying $125 vs $200, a $75/month savings with GPT-5.4 Nano. Given that GPT-5.4 Nano also wins on most benchmarks, the cost premium for Devstral Medium is hard to justify except in the specific case where classification accuracy is your primary bottleneck.
Real-World Cost Comparison
Bottom Line
Choose GPT-5.4 Nano if you need a capable general-purpose model for production use: it wins on structured output, strategic analysis, tool calling, creative problem solving, safety calibration, persona consistency, long context, multilingual, and constrained rewriting — all at a lower price. It also accepts image and file inputs alongside text, while Devstral Medium is text-only per the payload. The AIME 2025 score of 87.8% (Epoch AI, rank 8 of 23) adds confidence for math-adjacent tasks.
Choose Devstral Medium if your workload is primarily document or content classification and routing, where its 4/5 (tied 1st of 53) outpaces GPT-5.4 Nano's 3/5 (rank 31). That's the only benchmark where Devstral Medium has a measurable advantage, and the classification edge only justifies the higher cost ($2.00/M output vs $1.25/M) if classification accuracy directly drives business outcomes at scale.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.