Devstral Medium vs GPT-5 Mini

GPT-5 Mini is the practical pick for most users: it wins the majority (9/12) of our benchmarks and leads on structured output, long-context, faithfulness, and safety. Devstral Medium does not win any benchmark in our tests but may be chosen for provider preference or specific parameter support; note Devstral charges $0.40 per 1k input tokens vs GPT-5 Mini's $0.25.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-5 Mini

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
3/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
64.7%
MATH Level 5
97.8%
AIME 2025
86.7%

Pricing

Input

$0.250/MTok

Output

$2.00/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

Summary of head-to-head scores from our 12-test suite (Devstral Medium = A, GPT-5 Mini = B). Wins/ties: B wins 9 tests, A wins 0, ties 3. Wins for GPT-5 Mini (scores): structured_output 5 vs 4 (B tied for 1st of 54 models on structured_output), strategic_analysis 5 vs 2 (B tied for 1st of 54 on strategic_analysis), constrained_rewriting 4 vs 3 (B rank 6 of 53), creative_problem_solving 4 vs 2 (B rank 9 of 54), faithfulness 5 vs 4 (B tied for 1st of 55), long_context 5 vs 4 (B tied for 1st of 55 — important for 30K+ retrieval), safety_calibration 3 vs 1 (B rank 10 of 55), persona_consistency 5 vs 3 (B tied for 1st of 53), multilingual 5 vs 4 (B tied for 1st of 55). Ties: tool_calling 3 vs 3 (both rank 47 of 54), classification 4 vs 4 (both tied for 1st with many models), agentic_planning 4 vs 4 (both mid-top: rank 16 of 54). What this means in practice:

  • Structured output (JSON schema compliance): GPT-5 Mini's 5/5 and tie for top rank indicate stronger adherence to strict formats; expect fewer schema fixes and less post-processing when you need exact JSON/CSV outputs.
  • Long-context and retrieval: GPT-5 Mini scores 5/5 and is tied for 1st, with a 400,000-token context window listed — this supports tasks that require 30K+ token retrieval or very large documents. Devstral Medium lists a 131,072 context window and scored 4/5, so it is competent but behind GPT-5 Mini on our long-context tests.
  • Strategic analysis and faithfulness: GPT-5 Mini's 5/5 on strategic_analysis and faithfulness (tied for 1st) means it handles nuanced tradeoffs and sticks to source material better in our probes; Devstral scored 2/5 on strategic_analysis and 4/5 on faithfulness, so expect weaker numeric tradeoff reasoning yet decent fidelity to sources.
  • Safety and persona: GPT-5 Mini outperforms on safety_calibration (3 vs 1) and persona_consistency (5 vs 3), so it's more likely to follow refusal/safety guidance and maintain character in our tests.
  • Coding and tool workflows: tool_calling ties at 3/5 for both models and both rank 47 of 54, so neither has a clear advantage on function selection/sequencing in our suite. External benchmarks (Epoch AI): GPT-5 Mini scores 64.7% on SWE-bench Verified, 97.8% on MATH Level 5, and 86.7% on AIME 2025 — reported by Epoch AI; rankings: SWE-bench 8/12, math_level_5 rank 2/14 (shared), aime_2025 rank 9/23. These third-party results support GPT-5 Mini's strong math performance. Devstral Medium has no external benchmark scores in the payload.
BenchmarkDevstral MediumGPT-5 Mini
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/53/5
Classification4/54/5
Agentic Planning4/54/5
Structured Output4/55/5
Safety Calibration1/53/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins9 wins

Pricing Analysis

Assumption: a representative workload splits tokens 50/50 between input and output. Costs per million total tokens (1,000,000 tokens = 1,000 mTok):

  • Devstral Medium (input $0.40/mTok, output $2.00/mTok): 500 mTok input * $0.40 = $200; 500 mTok output * $2.00 = $1,000; total = $1,200 per 1M tokens.
  • GPT-5 Mini (input $0.25/mTok, output $2.00/mTok): 500 mTok input * $0.25 = $125; 500 mTok output * $2.00 = $1,000; total = $1,125 per 1M tokens. Scale examples (same 50/50 split):
  • 1M tokens/month: Devstral $1,200 vs GPT-5 Mini $1,125 (GPT-5 Mini saves $75).
  • 10M tokens/month: Devstral $12,000 vs GPT-5 Mini $11,250 (saves $750).
  • 100M tokens/month: Devstral $120,000 vs GPT-5 Mini $112,500 (saves $7,500). Who should care: high-volume apps, batch processing, or analytics teams—savings compound quickly at tens of millions of tokens. For small-scale or latency-driven experiments the per-month delta is minor, but for production at 10M+ tokens the difference becomes material.

Real-World Cost Comparison

TaskDevstral MediumGPT-5 Mini
iChat response$0.0011$0.0010
iBlog post$0.0042$0.0041
iDocument batch$0.108$0.105
iPipeline run$1.08$1.05

Bottom Line

Choose GPT-5 Mini if: you need best-in-suite structured-output, long-context (400k tokens), stronger safety calibration, multilingual parity, or top-ranked strategic analysis and math (Epoch AI math_level_5 97.8%). Its input pricing ($0.25/mTok) also reduces costs at scale. Choose Devstral Medium if: you prefer the mistral provider, require specific supported parameters that Devstral lists (e.g., frequency_penalty, temperature, top_p), or are experimenting at small scale where the ~$75/M-token saving is negligible. Note: in our tests Devstral Medium did not win any benchmark against GPT-5 Mini.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions