Devstral Medium vs GPT-5.4
For most production use cases that prioritize capability, GPT-5.4 is the better pick—it wins 11 of 12 tests in our suite and posts top ranks on long-context, faithfulness, and agentic planning. Devstral Medium is the value choice: it wins only classification in our tests but costs a small fraction of GPT-5.4, making it attractive for high-volume or budget-sensitive deployments.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Benchmark Analysis
Summary: GPT-5.4 wins 11 categories in our 12-test suite; Devstral Medium wins 1 (classification). Detailed walk-through (score: Devstral → GPT-5.4):
- Structured output: 4 → 5 — GPT-5.4 wins and is tied for 1st on structured_output (rank: tied for 1st of 54). This means GPT-5.4 is more reliable for strict JSON/schema compliance.
- Strategic analysis: 2 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). Expect stronger nuanced tradeoff reasoning and numeric planning from GPT-5.4.
- Constrained rewriting: 3 → 4 — GPT-5.4 wins (rank 6 of 53). GPT-5.4 better preserves content while compressing to tight character limits.
- Creative problem solving: 2 → 4 — GPT-5.4 wins (rank 9 of 54). GPT-5.4 generates more feasible, non-obvious ideas in our tests.
- Tool calling: 3 → 4 — GPT-5.4 wins (Devstral rank 47 of 54; GPT-5.4 rank 18 of 54). GPT-5.4 is more accurate at selecting functions, arguments, and sequencing calls.
- Faithfulness: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 better resists hallucination and sticks to sources in our testing.
- Long context: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). Practically, GPT-5.4 performs better on retrieval and reasoning across 30K+ token contexts.
- Safety calibration: 1 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 consistently refuses harmful prompts while permitting legitimate ones; Devstral underperforms here.
- Persona consistency: 3 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 53). GPT-5.4 better maintains character and resists prompt injection.
- Agentic planning: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). GPT-5.4 decomposes goals and recovers from failure more robustly.
- Multilingual: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 produces higher-quality non-English outputs in our tests.
- Classification: 4 → 3 — Devstral Medium wins (Devstral tied for 1st with 29 others; GPT-5.4 rank 31 of 53). Devstral is at least as good or better for basic routing/categorization tasks in our suite.
External benchmarks (supplementary): On SWE-bench Verified (Epoch AI) GPT-5.4 scores 76.9% and ranks 2 of 12; on AIME 2025 (Epoch AI) GPT-5.4 scores 95.3% and ranks 3 of 23. These external results corroborate GPT-5.4’s strength on coding and competition-level math tasks. Devstral Medium has no external scores in the payload. Overall interpretation: GPT-5.4 delivers higher capability across practically every evaluated dimension (especially safety, long-context, faithfulness, and agentic planning); Devstral’s one clear win is classification plus a much lower price point.
Pricing Analysis
Devstral Medium input/output: $0.40 / $2.00 per mTok. GPT-5.4 input/output: $2.50 / $15.00 per mTok. Assuming a 50/50 split of input vs output tokens, costs are: 1M tokens → Devstral $1,200 vs GPT-5.4 $8,750 (Devstral saves $7,550); 10M → Devstral $12,000 vs GPT-5.4 $87,500; 100M → Devstral $120,000 vs GPT-5.4 $875,000. The gap matters for high-volume apps, startups, analytics pipelines, and any product where tokens scale into the millions—Devstral cuts recurring inference spend by roughly 7–8x under the 50/50 assumption. Teams prioritizing top-tier safety, long-context reasoning, or third-party benchmark excellence should budget for GPT-5.4’s higher rates.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: you need a dramatically cheaper inference option ($0.40 input / $2.00 output per mTok), you run very high token volumes, or your primary tasks are high-throughput classification and cost-sensitive pipelines. Choose GPT-5.4 if: you require best-in-class long-context reasoning, faithfulness, safety calibration, tool calling, multilingual output, or top results on third-party coding/maths benchmarks (SWE-bench 76.9% and AIME 95.3% per Epoch AI). If budget is tight but you need some GPT-5.4 capabilities, test a hybrid approach (Devstral for bulk classification + GPT-5.4 for complex planning or safety-sensitive flows).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.