Devstral Medium vs GPT-5 Mini
GPT-5 Mini is the practical pick for most users: it wins the majority (9/12) of our benchmarks and leads on structured output, long-context, faithfulness, and safety. Devstral Medium does not win any benchmark in our tests but may be chosen for provider preference or specific parameter support; note Devstral charges $0.40 per 1k input tokens vs GPT-5 Mini's $0.25.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
openai
GPT-5 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.250/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
Summary of head-to-head scores from our 12-test suite (Devstral Medium = A, GPT-5 Mini = B). Wins/ties: B wins 9 tests, A wins 0, ties 3. Wins for GPT-5 Mini (scores): structured_output 5 vs 4 (B tied for 1st of 54 models on structured_output), strategic_analysis 5 vs 2 (B tied for 1st of 54 on strategic_analysis), constrained_rewriting 4 vs 3 (B rank 6 of 53), creative_problem_solving 4 vs 2 (B rank 9 of 54), faithfulness 5 vs 4 (B tied for 1st of 55), long_context 5 vs 4 (B tied for 1st of 55 — important for 30K+ retrieval), safety_calibration 3 vs 1 (B rank 10 of 55), persona_consistency 5 vs 3 (B tied for 1st of 53), multilingual 5 vs 4 (B tied for 1st of 55). Ties: tool_calling 3 vs 3 (both rank 47 of 54), classification 4 vs 4 (both tied for 1st with many models), agentic_planning 4 vs 4 (both mid-top: rank 16 of 54). What this means in practice:
- Structured output (JSON schema compliance): GPT-5 Mini's 5/5 and tie for top rank indicate stronger adherence to strict formats; expect fewer schema fixes and less post-processing when you need exact JSON/CSV outputs.
- Long-context and retrieval: GPT-5 Mini scores 5/5 and is tied for 1st, with a 400,000-token context window listed — this supports tasks that require 30K+ token retrieval or very large documents. Devstral Medium lists a 131,072 context window and scored 4/5, so it is competent but behind GPT-5 Mini on our long-context tests.
- Strategic analysis and faithfulness: GPT-5 Mini's 5/5 on strategic_analysis and faithfulness (tied for 1st) means it handles nuanced tradeoffs and sticks to source material better in our probes; Devstral scored 2/5 on strategic_analysis and 4/5 on faithfulness, so expect weaker numeric tradeoff reasoning yet decent fidelity to sources.
- Safety and persona: GPT-5 Mini outperforms on safety_calibration (3 vs 1) and persona_consistency (5 vs 3), so it's more likely to follow refusal/safety guidance and maintain character in our tests.
- Coding and tool workflows: tool_calling ties at 3/5 for both models and both rank 47 of 54, so neither has a clear advantage on function selection/sequencing in our suite. External benchmarks (Epoch AI): GPT-5 Mini scores 64.7% on SWE-bench Verified, 97.8% on MATH Level 5, and 86.7% on AIME 2025 — reported by Epoch AI; rankings: SWE-bench 8/12, math_level_5 rank 2/14 (shared), aime_2025 rank 9/23. These third-party results support GPT-5 Mini's strong math performance. Devstral Medium has no external benchmark scores in the payload.
Pricing Analysis
Assumption: a representative workload splits tokens 50/50 between input and output. Costs per million total tokens (1,000,000 tokens = 1,000 mTok):
- Devstral Medium (input $0.40/mTok, output $2.00/mTok): 500 mTok input * $0.40 = $200; 500 mTok output * $2.00 = $1,000; total = $1,200 per 1M tokens.
- GPT-5 Mini (input $0.25/mTok, output $2.00/mTok): 500 mTok input * $0.25 = $125; 500 mTok output * $2.00 = $1,000; total = $1,125 per 1M tokens. Scale examples (same 50/50 split):
- 1M tokens/month: Devstral $1,200 vs GPT-5 Mini $1,125 (GPT-5 Mini saves $75).
- 10M tokens/month: Devstral $12,000 vs GPT-5 Mini $11,250 (saves $750).
- 100M tokens/month: Devstral $120,000 vs GPT-5 Mini $112,500 (saves $7,500). Who should care: high-volume apps, batch processing, or analytics teams—savings compound quickly at tens of millions of tokens. For small-scale or latency-driven experiments the per-month delta is minor, but for production at 10M+ tokens the difference becomes material.
Real-World Cost Comparison
Bottom Line
Choose GPT-5 Mini if: you need best-in-suite structured-output, long-context (400k tokens), stronger safety calibration, multilingual parity, or top-ranked strategic analysis and math (Epoch AI math_level_5 97.8%). Its input pricing ($0.25/mTok) also reduces costs at scale. Choose Devstral Medium if: you prefer the mistral provider, require specific supported parameters that Devstral lists (e.g., frequency_penalty, temperature, top_p), or are experimenting at small scale where the ~$75/M-token saving is negligible. Note: in our tests Devstral Medium did not win any benchmark against GPT-5 Mini.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.