Devstral Medium vs GPT-5.4 Nano

GPT-5.4 Nano is the stronger choice for most workloads: in our testing it outscored Devstral Medium on 9 of 12 benchmarks while costing less — $0.20/$1.25 per million input/output tokens versus $0.40/$2.00. Devstral Medium's only benchmark win is classification (4 vs 3), making it a narrow alternative for routing and categorization tasks. For general-purpose use, agentic pipelines, or anything touching structured output, strategic reasoning, or safety, GPT-5.4 Nano is the clearer pick.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-5.4 Nano

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
3/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
87.8%

Pricing

Input

$0.200/MTok

Output

$1.25/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, GPT-5.4 Nano wins 9 benchmarks, Devstral Medium wins 1, and they tie on 2.

Where GPT-5.4 Nano leads:

  • Structured output (5 vs 4): GPT-5.4 Nano ties for 1st among 54 models; Devstral Medium sits at rank 26. For any pipeline that depends on reliable JSON schema compliance, this gap matters.
  • Strategic analysis (5 vs 2): GPT-5.4 Nano ties for 1st among 54 models; Devstral Medium ranks 44th. A 3-point gap on nuanced tradeoff reasoning is significant — Devstral Medium's 2/5 here falls below the median (p50 = 4), meaning it underperforms most tested models on this dimension.
  • Constrained rewriting (4 vs 3): GPT-5.4 Nano ranks 6th of 53; Devstral Medium is at 31st.
  • Creative problem solving (4 vs 2): GPT-5.4 Nano ranks 9th of 54; Devstral Medium ranks 47th — near the bottom of the field. Devstral Medium's 2/5 is well below the p25 of 3.
  • Tool calling (4 vs 3): GPT-5.4 Nano ranks 18th of 54; Devstral Medium ranks 47th. For agentic workflows where function selection and argument accuracy matter, Devstral Medium's score is a liability.
  • Long context (5 vs 4): GPT-5.4 Nano ties for 1st among 55 models. Devstral Medium scores 4 but ranks 38th — workable, but not a strength.
  • Safety calibration (3 vs 1): GPT-5.4 Nano ranks 10th of 55; Devstral Medium ranks 32nd with a score of 1/5, below the p25 of 1. This is the most concerning gap — Devstral Medium's safety calibration score suggests it may over-refuse or under-refuse relative to appropriate thresholds.
  • Persona consistency (5 vs 3): GPT-5.4 Nano ties for 1st among 53 models; Devstral Medium ranks 45th.
  • Multilingual (5 vs 4): GPT-5.4 Nano ties for 1st among 55 models; Devstral Medium ranks 36th.

Where Devstral Medium leads:

  • Classification (4 vs 3): Devstral Medium ties for 1st among 53 models — a genuine strength. GPT-5.4 Nano ranks 31st. If categorization and routing are your core use case, Devstral Medium has a real edge here.

Ties:

  • Faithfulness (4 vs 4): Both rank 34th of 55 — identical performance on sticking to source material.
  • Agentic planning (4 vs 4): Both rank 16th of 54 — evenly matched on goal decomposition.

External benchmark: GPT-5.4 Nano scores 87.8% on AIME 2025 (Epoch AI), ranking 8th of 23 models with that data. Devstral Medium has no AIME 2025 score in our data. The 87.8% result places GPT-5.4 Nano above the p50 of 83.9% for models with that benchmark, confirming solid math reasoning performance.

BenchmarkDevstral MediumGPT-5.4 Nano
Faithfulness4/54/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/54/5
Classification4/53/5
Agentic Planning4/54/5
Structured Output4/55/5
Safety Calibration1/53/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins9 wins

Pricing Analysis

GPT-5.4 Nano costs $0.20/M input and $1.25/M output. Devstral Medium costs $0.40/M input and $2.00/M output — 2× more expensive on input and 60% more on output. At 1M output tokens/month, that's $1.25 vs $2.00 — a $0.75 difference that barely registers. At 10M output tokens it's $12.50 vs $20.00 — a $7.50 gap that starts to matter for tighter budgets. At 100M output tokens/month — typical for a production API product — you're paying $125 vs $200, a $75/month savings with GPT-5.4 Nano. Given that GPT-5.4 Nano also wins on most benchmarks, the cost premium for Devstral Medium is hard to justify except in the specific case where classification accuracy is your primary bottleneck.

Real-World Cost Comparison

TaskDevstral MediumGPT-5.4 Nano
iChat response$0.0011<$0.001
iBlog post$0.0042$0.0026
iDocument batch$0.108$0.067
iPipeline run$1.08$0.665

Bottom Line

Choose GPT-5.4 Nano if you need a capable general-purpose model for production use: it wins on structured output, strategic analysis, tool calling, creative problem solving, safety calibration, persona consistency, long context, multilingual, and constrained rewriting — all at a lower price. It also accepts image and file inputs alongside text, while Devstral Medium is text-only per the payload. The AIME 2025 score of 87.8% (Epoch AI, rank 8 of 23) adds confidence for math-adjacent tasks.

Choose Devstral Medium if your workload is primarily document or content classification and routing, where its 4/5 (tied 1st of 53) outpaces GPT-5.4 Nano's 3/5 (rank 31). That's the only benchmark where Devstral Medium has a measurable advantage, and the classification edge only justifies the higher cost ($2.00/M output vs $1.25/M) if classification accuracy directly drives business outcomes at scale.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions