Devstral Medium vs GPT-4o-mini
GPT-4o-mini is the better pick for most production use cases: it wins the critical tool-calling, safety-calibration, and persona-consistency tests and costs far less. Devstral Medium outperforms GPT-4o-mini on faithfulness and agentic planning, so choose it when strict adherence to source material and stronger goal decomposition matter despite a higher price.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
openai
GPT-4o-mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.600/MTok
modelpicker.net
Benchmark Analysis
Test-by-test in our 12-test suite: - Tool calling: GPT-4o-mini 4 vs Devstral Medium 3 — GPT-4o-mini wins and ranks 18 of 54 (many models share the score); Devstral ranks 47 of 54, indicating weaker function selection and argument accuracy in our tests. - Safety calibration: GPT-4o-mini 4 vs Devstral Medium 1 — GPT-4o-mini wins decisively and ranks 6 of 55; Devstral sits at rank 32, so GPT-4o-mini is far better at refusing harmful inputs while permitting legitimate ones. - Persona consistency: GPT-4o-mini 4 vs Devstral Medium 3 — GPT-4o-mini wins (rank 38 vs Devstral rank 45), meaning GPT-4o-mini better maintains character and resists prompt injection in our runs. - Faithfulness: Devstral Medium 4 vs GPT-4o-mini 3 — Devstral wins and ranks 34 of 55 vs GPT-4o-mini at 52 of 55, so Devstral is more likely to stick to source material and avoid hallucination in our tests. - Agentic planning: Devstral Medium 4 vs GPT-4o-mini 3 — Devstral wins (rank 16 vs 42), showing stronger goal decomposition and failure-recovery behavior. - Classification: tie 4/4 — both tied for 1st (tied with 29 other models), so routing/classification tasks are comparable. - Long context, structured output, constrained rewriting, creative problem solving, strategic analysis, multilingual: all ties in our scoring (equal numeric values). Practical meaning: pick GPT-4o-mini when you need reliable tool integrations, safety behavior, and persona handling at scale; pick Devstral Medium when faithfulness and multi-step agentic planning are the deciding factors. External benchmarks: GPT-4o-mini also provides measurable third-party results — 52.6% on MATH Level 5 and 6.9% on AIME 2025 (Epoch AI) — which are supplementary data points from Epoch AI and not our internal scores.
Pricing Analysis
Per the payload, Devstral Medium costs $0.40 input / $2.00 output per 1k tokens; GPT-4o-mini costs $0.15 input / $0.60 output per 1k tokens (price ratio 3.33). Assuming a 50/50 split of input/output tokens: at 1M total tokens/month Devstral Medium ≈ $1,200/month (500k input → $200; 500k output → $1,000) vs GPT-4o-mini ≈ $375/month (500k input → $75; 500k output → $300). At 10M tokens/month multiply by 10: $12,000 vs $3,750. At 100M tokens/month: $120,000 vs $37,500. Whoever runs high-volume services (chat fleets, large-scale API integrations) should care deeply about this gap—GPT-4o-mini reduces recurring token costs by roughly two-thirds in this balanced scenario. Organizations that prioritize Devstral Medium’s wins in faithfulness and agentic planning must budget for the higher spend or limit usage to high-value requests.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: you need stronger faithfulness and agentic planning (scores 4 on faithfulness and agentic_planning in our tests and better ranks on agentic_planning), and you can pay ~3.33× more per token for higher accuracy on those dimensions. Choose GPT-4o-mini if: you need cost efficiency at scale (input/output $0.15/$0.60), better tool calling (4 vs 3), stronger safety calibration (4 vs 1), and better persona consistency (4 vs 3); also pick it when multimodal inputs (text+image+file) are required.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.