Devstral Medium vs GPT-5 Nano
GPT-5 Nano is the practical pick for most developers and teams: it wins 8 of 12 benchmarks in our tests and costs far less per token. Devstral Medium only wins classification in our suite and may still appeal if you prioritize the model's marketed code-generation positioning despite a higher price.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
openai
GPT-5 Nano
Benchmark Scores
External Benchmarks
Pricing
Input
$0.050/MTok
Output
$0.400/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, GPT-5 Nano wins the majority (8 metrics) while Devstral Medium wins 1; three tests tie. Metric-by-metric (score A = Devstral, B = GPT-5 Nano): - Structured output (JSON/schema): 4 vs 5 — GPT-5 Nano wins and is tied for 1st on this task (tied with 24 others), meaning Nano is more reliable for strict schema compliance. - Strategic analysis (tradeoff math): 2 vs 4 — GPT-5 Nano wins (rank 27/54) so it handles nuanced numeric tradeoffs better in our tests. - Creative problem solving: 2 vs 3 — GPT-5 Nano wins, indicating better generation of non-obvious, actionable ideas. - Tool calling: 3 vs 4 — GPT-5 Nano wins and ranks 18 of 54, so it selects functions and arguments more accurately in our tool-calling scenarios. - Long context (30K+): 4 vs 5 — GPT-5 Nano wins and is tied for 1st on long-context retrieval, so it performs better on long-document tasks. - Safety calibration: 1 vs 4 — GPT-5 Nano strongly wins (rank 6 of 55), refusing harmful requests more reliably in our tests. - Persona consistency: 3 vs 4 — GPT-5 Nano wins, maintaining character better in our prompts. - Multilingual: 4 vs 5 — GPT-5 Nano wins and is tied for 1st, producing higher-quality non-English outputs in our evaluation. - Classification: 4 vs 3 — Devstral Medium wins here and is tied for 1st among 29 other models, so it’s a strong router/categorizer in our suite. - Constrained rewriting, Faithfulness, Agentic planning: ties (3/4/4 respectively across models) — neither model dominates on these. External benchmarks (supplementary): GPT-5 Nano scores 95.2% on MATH Level 5 and 81.1% on AIME 2025 (Epoch AI), which supports its strength on advanced math tasks. In short: Nano is stronger for structured outputs, long-context, tool workflows, multilingual and safety; Devstral's clear win is classification accuracy in our tests.
Pricing Analysis
Raw per-mTok prices: Devstral Medium charges $0.40 input / $2.00 output; GPT-5 Nano charges $0.05 input / $0.40 output. At realistic volumes assuming a 50/50 split of input/output tokens: - 1M total tokens (500k input + 500k output): Devstral = $1,200; GPT-5 Nano = $225. - 10M total tokens: Devstral = $12,000; GPT-5 Nano = $2,250. - 100M total tokens: Devstral = $120,000; GPT-5 Nano = $22,500. If you instead measure costs per million input tokens: Devstral = $400 / M (input) and $2,000 / M (output); GPT-5 Nano = $50 / M (input) and $400 / M (output). Teams doing high-volume inference (10M+ tokens/month) will see five- to ten-fold savings with GPT-5 Nano and should care about the gap; small-scale prototypes may tolerate Devstral's premium if they value its marketed strengths, but expect much higher monthly bills with Devstral Medium.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: - Your primary need is top-tier classification/routing (Devstral scores 4/5 and is tied for 1st on classification in our tests) AND you can absorb much higher costs ($0.40 in / $2.00 out per mTok). Choose GPT-5 Nano if: - You need reliable structured outputs, long-context understanding, tool-calling, multilingual performance, or stronger safety (GPT-5 Nano wins those categories in our 12-test suite) AND you want dramatically lower token costs ($0.05 in / $0.40 out per mTok). For high-volume production (10M+ tokens/month) GPT-5 Nano is the cost-effective winner in most real tasks we measured.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.