Is Devstral Medium better than GPT-5.4?

In our 12-test suite GPT-5.4 wins 11 categories and Devstral Medium wins 1 (classification). Devstral is cheaper but underperforms GPT-5.4 on long-context, faithfulness, safety, tool calling, and agentic planning.

Which model is cheaper to run?

Devstral Medium is much cheaper: $0.40 input / $2.00 output per mTok vs GPT-5.4 at $2.50 / $15.00 per mTok. Under a 50/50 input/output split that’s about $1,200 vs $8,750 per 1M tokens.

Which model is better for coding or SWE-bench tasks?

GPT-5.4 posts a SWE-bench Verified score of 76.9% and ranks 2 of 12 on that external benchmark (Epoch AI). The payload includes no SWE-bench score for Devstral Medium, so GPT-5.4 is stronger on that external coding measure in our review.

Which model is safer for production?

GPT-5.4 scored 5 on safety_calibration in our tests (tied for 1st of 55), while Devstral Medium scored 1 (rank 32 of 55). In our testing GPT-5.4 better refuses harmful prompts and permits legitimate ones.

How do they compare on long context and memory?

GPT-5.4 scored 5 vs Devstral Medium 4 on long_context and is tied for 1st in our ranking (rank 1 of 55). GPT-5.4’s context_window in the payload is ~1,050,000 vs Devstral Medium’s 131,072, and GPT-5.4 won our long-context tests.

Which should I pick for multilingual applications?

GPT-5.4 scored 5 vs Devstral Medium 4 on multilingual tasks and is tied for 1st in that category (rank 1 of 55), so GPT-5.4 produces higher-quality non-English outputs in our tests.

Devstral Medium vs GPT-5.4

For most production use cases that prioritize capability, GPT-5.4 is the better pick—it wins 11 of 12 tests in our suite and posts top ranks on long-context, faithfulness, and agentic planning. Devstral Medium is the value choice: it wins only classification in our tests but costs a small fraction of GPT-5.4, making it attractive for high-volume or budget-sensitive deployments.

mistral

Devstral Medium

Overall

3.17/5Usable

Benchmark Scores

Faithfulness

4/5

Long Context

4/5

Multilingual

4/5

Tool Calling

3/5

Classification

4/5

Agentic Planning

4/5

Structured Output

4/5

Safety Calibration

1/5

Strategic Analysis

2/5

Persona Consistency

3/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-5.4

Overall

4.58/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

5/5

Tool Calling

4/5

Classification

3/5

Agentic Planning

5/5

Structured Output

5/5

Safety Calibration

5/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

4/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

76.9%

MATH Level 5

N/A

AIME 2025

95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Benchmark Analysis

Summary: GPT-5.4 wins 11 categories in our 12-test suite; Devstral Medium wins 1 (classification). Detailed walk-through (score: Devstral → GPT-5.4):

Structured output: 4 → 5 — GPT-5.4 wins and is tied for 1st on structured_output (rank: tied for 1st of 54). This means GPT-5.4 is more reliable for strict JSON/schema compliance.
Strategic analysis: 2 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). Expect stronger nuanced tradeoff reasoning and numeric planning from GPT-5.4.
Constrained rewriting: 3 → 4 — GPT-5.4 wins (rank 6 of 53). GPT-5.4 better preserves content while compressing to tight character limits.
Creative problem solving: 2 → 4 — GPT-5.4 wins (rank 9 of 54). GPT-5.4 generates more feasible, non-obvious ideas in our tests.
Tool calling: 3 → 4 — GPT-5.4 wins (Devstral rank 47 of 54; GPT-5.4 rank 18 of 54). GPT-5.4 is more accurate at selecting functions, arguments, and sequencing calls.
Faithfulness: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 better resists hallucination and sticks to sources in our testing.
Long context: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). Practically, GPT-5.4 performs better on retrieval and reasoning across 30K+ token contexts.
Safety calibration: 1 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 consistently refuses harmful prompts while permitting legitimate ones; Devstral underperforms here.
Persona consistency: 3 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 53). GPT-5.4 better maintains character and resists prompt injection.
Agentic planning: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 54). GPT-5.4 decomposes goals and recovers from failure more robustly.
Multilingual: 4 → 5 — GPT-5.4 wins and is tied for 1st (rank 1 of 55). GPT-5.4 produces higher-quality non-English outputs in our tests.
Classification: 4 → 3 — Devstral Medium wins (Devstral tied for 1st with 29 others; GPT-5.4 rank 31 of 53). Devstral is at least as good or better for basic routing/categorization tasks in our suite.

External benchmarks (supplementary): On SWE-bench Verified (Epoch AI) GPT-5.4 scores 76.9% and ranks 2 of 12; on AIME 2025 (Epoch AI) GPT-5.4 scores 95.3% and ranks 3 of 23. These external results corroborate GPT-5.4’s strength on coding and competition-level math tasks. Devstral Medium has no external scores in the payload. Overall interpretation: GPT-5.4 delivers higher capability across practically every evaluated dimension (especially safety, long-context, faithfulness, and agentic planning); Devstral’s one clear win is classification plus a much lower price point.

BenchmarkDevstral MediumGPT-5.4

Faithfulness4/55/5

Long Context4/55/5

Multilingual4/55/5

Tool Calling3/54/5

Classification4/53/5

Agentic Planning4/55/5

Structured Output4/55/5

Safety Calibration1/55/5

Strategic Analysis2/55/5

Persona Consistency3/55/5

Constrained Rewriting3/54/5

Creative Problem Solving2/54/5

Summary1 wins11 wins

Pricing Analysis

Devstral Medium input/output: $0.40 / $2.00 per mTok. GPT-5.4 input/output: $2.50 / $15.00 per mTok. Assuming a 50/50 split of input vs output tokens, costs are: 1M tokens → Devstral $1,200 vs GPT-5.4 $8,750 (Devstral saves $7,550); 10M → Devstral $12,000 vs GPT-5.4 $87,500; 100M → Devstral $120,000 vs GPT-5.4 $875,000. The gap matters for high-volume apps, startups, analytics pipelines, and any product where tokens scale into the millions—Devstral cuts recurring inference spend by roughly 7–8x under the 50/50 assumption. Teams prioritizing top-tier safety, long-context reasoning, or third-party benchmark excellence should budget for GPT-5.4’s higher rates.

Real-World Cost Comparison

TaskDevstral MediumGPT-5.4

iChat response$0.0011$0.0080

iBlog post$0.0042$0.031

iDocument batch$0.108$0.800

iPipeline run$1.08$8.00

Bottom Line

Choose Devstral Medium if: you need a dramatically cheaper inference option ($0.40 input / $2.00 output per mTok), you run very high token volumes, or your primary tasks are high-throughput classification and cost-sensitive pipelines. Choose GPT-5.4 if: you require best-in-class long-context reasoning, faithfulness, safety calibration, tool calling, multilingual output, or top results on third-party coding/maths benchmarks (SWE-bench 76.9% and AIME 95.3% per Epoch AI). If budget is tight but you need some GPT-5.4 capabilities, test a hybrid approach (Devstral for bulk classification + GPT-5.4 for complex planning or safety-sensitive flows).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.