Devstral Medium vs GPT-5.4

For most production use cases that prioritize capability, GPT-5.4 is the better pick—it wins 11 of 12 tests in our suite and posts top ranks on long-context, faithfulness, and agentic planning. Devstral Medium is the value choice: it wins only classification in our tests but costs a small fraction of GPT-5.4, making it attractive for high-volume or budget-sensitive deployments.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Benchmark Analysis

Summary: GPT-5.4 wins 11 categories in our 12-test suite; Devstral Medium wins 1 (classification). Detailed walk-through (score: Devstral → GPT-5.4):

  • Structured output: 4 → 5 — GPT-5.4 wins and is tied for 1st on structured_output (rank: tied for 1st of 54). This means GPT-5.4 is more reliable for strict JSON/schema compliance.
  • Strategic analysis: 2 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). Expect stronger nuanced tradeoff reasoning and numeric planning from GPT-5.4.
  • Constrained rewriting: 3 → 4 — GPT-5.4 wins (rank 6 of 53). GPT-5.4 better preserves content while compressing to tight character limits.
  • Creative problem solving: 2 → 4 — GPT-5.4 wins (rank 9 of 54). GPT-5.4 generates more feasible, non-obvious ideas in our tests.
  • Tool calling: 3 → 4 — GPT-5.4 wins (Devstral rank 47 of 54; GPT-5.4 rank 18 of 54). GPT-5.4 is more accurate at selecting functions, arguments, and sequencing calls.
  • Faithfulness: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 better resists hallucination and sticks to sources in our testing.
  • Long context: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). Practically, GPT-5.4 performs better on retrieval and reasoning across 30K+ token contexts.
  • Safety calibration: 1 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 consistently refuses harmful prompts while permitting legitimate ones; Devstral underperforms here.
  • Persona consistency: 3 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 53). GPT-5.4 better maintains character and resists prompt injection.
  • Agentic planning: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). GPT-5.4 decomposes goals and recovers from failure more robustly.
  • Multilingual: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 produces higher-quality non-English outputs in our tests.
  • Classification: 4 → 3 — Devstral Medium wins (Devstral tied for 1st with 29 others; GPT-5.4 rank 31 of 53). Devstral is at least as good or better for basic routing/categorization tasks in our suite.

External benchmarks (supplementary): On SWE-bench Verified (Epoch AI) GPT-5.4 scores 76.9% and ranks 2 of 12; on AIME 2025 (Epoch AI) GPT-5.4 scores 95.3% and ranks 3 of 23. These external results corroborate GPT-5.4’s strength on coding and competition-level math tasks. Devstral Medium has no external scores in the payload. Overall interpretation: GPT-5.4 delivers higher capability across practically every evaluated dimension (especially safety, long-context, faithfulness, and agentic planning); Devstral’s one clear win is classification plus a much lower price point.

BenchmarkDevstral MediumGPT-5.4
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/54/5
Classification4/53/5
Agentic Planning4/55/5
Structured Output4/55/5
Safety Calibration1/55/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins11 wins

Pricing Analysis

Devstral Medium input/output: $0.40 / $2.00 per mTok. GPT-5.4 input/output: $2.50 / $15.00 per mTok. Assuming a 50/50 split of input vs output tokens, costs are: 1M tokens → Devstral $1,200 vs GPT-5.4 $8,750 (Devstral saves $7,550); 10M → Devstral $12,000 vs GPT-5.4 $87,500; 100M → Devstral $120,000 vs GPT-5.4 $875,000. The gap matters for high-volume apps, startups, analytics pipelines, and any product where tokens scale into the millions—Devstral cuts recurring inference spend by roughly 7–8x under the 50/50 assumption. Teams prioritizing top-tier safety, long-context reasoning, or third-party benchmark excellence should budget for GPT-5.4’s higher rates.

Real-World Cost Comparison

TaskDevstral MediumGPT-5.4
iChat response$0.0011$0.0080
iBlog post$0.0042$0.031
iDocument batch$0.108$0.800
iPipeline run$1.08$8.00

Bottom Line

Choose Devstral Medium if: you need a dramatically cheaper inference option ($0.40 input / $2.00 output per mTok), you run very high token volumes, or your primary tasks are high-throughput classification and cost-sensitive pipelines. Choose GPT-5.4 if: you require best-in-class long-context reasoning, faithfulness, safety calibration, tool calling, multilingual output, or top results on third-party coding/maths benchmarks (SWE-bench 76.9% and AIME 95.3% per Epoch AI). If budget is tight but you need some GPT-5.4 capabilities, test a hybrid approach (Devstral for bulk classification + GPT-5.4 for complex planning or safety-sensitive flows).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions