Devstral Medium vs GPT-4o-mini

GPT-4o-mini is the better pick for most production use cases: it wins the critical tool-calling, safety-calibration, and persona-consistency tests and costs far less. Devstral Medium outperforms GPT-4o-mini on faithfulness and agentic planning, so choose it when strict adherence to source material and stronger goal decomposition matter despite a higher price.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-4o-mini

Overall
3.42/5Usable

Benchmark Scores

Faithfulness
3/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
4/5
Strategic Analysis
2/5
Persona Consistency
4/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
52.6%
AIME 2025
6.9%

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Test-by-test in our 12-test suite: - Tool calling: GPT-4o-mini 4 vs Devstral Medium 3 — GPT-4o-mini wins and ranks 18 of 54 (many models share the score); Devstral ranks 47 of 54, indicating weaker function selection and argument accuracy in our tests. - Safety calibration: GPT-4o-mini 4 vs Devstral Medium 1 — GPT-4o-mini wins decisively and ranks 6 of 55; Devstral sits at rank 32, so GPT-4o-mini is far better at refusing harmful inputs while permitting legitimate ones. - Persona consistency: GPT-4o-mini 4 vs Devstral Medium 3 — GPT-4o-mini wins (rank 38 vs Devstral rank 45), meaning GPT-4o-mini better maintains character and resists prompt injection in our runs. - Faithfulness: Devstral Medium 4 vs GPT-4o-mini 3 — Devstral wins and ranks 34 of 55 vs GPT-4o-mini at 52 of 55, so Devstral is more likely to stick to source material and avoid hallucination in our tests. - Agentic planning: Devstral Medium 4 vs GPT-4o-mini 3 — Devstral wins (rank 16 vs 42), showing stronger goal decomposition and failure-recovery behavior. - Classification: tie 4/4 — both tied for 1st (tied with 29 other models), so routing/classification tasks are comparable. - Long context, structured output, constrained rewriting, creative problem solving, strategic analysis, multilingual: all ties in our scoring (equal numeric values). Practical meaning: pick GPT-4o-mini when you need reliable tool integrations, safety behavior, and persona handling at scale; pick Devstral Medium when faithfulness and multi-step agentic planning are the deciding factors. External benchmarks: GPT-4o-mini also provides measurable third-party results — 52.6% on MATH Level 5 and 6.9% on AIME 2025 (Epoch AI) — which are supplementary data points from Epoch AI and not our internal scores.

BenchmarkDevstral MediumGPT-4o-mini
Faithfulness4/53/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/53/5
Structured Output4/54/5
Safety Calibration1/54/5
Strategic Analysis2/52/5
Persona Consistency3/54/5
Constrained Rewriting3/53/5
Creative Problem Solving2/52/5
Summary2 wins3 wins

Pricing Analysis

Per the payload, Devstral Medium costs $0.40 input / $2.00 output per 1k tokens; GPT-4o-mini costs $0.15 input / $0.60 output per 1k tokens (price ratio 3.33). Assuming a 50/50 split of input/output tokens: at 1M total tokens/month Devstral Medium ≈ $1,200/month (500k input → $200; 500k output → $1,000) vs GPT-4o-mini ≈ $375/month (500k input → $75; 500k output → $300). At 10M tokens/month multiply by 10: $12,000 vs $3,750. At 100M tokens/month: $120,000 vs $37,500. Whoever runs high-volume services (chat fleets, large-scale API integrations) should care deeply about this gap—GPT-4o-mini reduces recurring token costs by roughly two-thirds in this balanced scenario. Organizations that prioritize Devstral Medium’s wins in faithfulness and agentic planning must budget for the higher spend or limit usage to high-value requests.

Real-World Cost Comparison

TaskDevstral MediumGPT-4o-mini
iChat response$0.0011<$0.001
iBlog post$0.0042$0.0013
iDocument batch$0.108$0.033
iPipeline run$1.08$0.330

Bottom Line

Choose Devstral Medium if: you need stronger faithfulness and agentic planning (scores 4 on faithfulness and agentic_planning in our tests and better ranks on agentic_planning), and you can pay ~3.33× more per token for higher accuracy on those dimensions. Choose GPT-4o-mini if: you need cost efficiency at scale (input/output $0.15/$0.60), better tool calling (4 vs 3), stronger safety calibration (4 vs 1), and better persona consistency (4 vs 3); also pick it when multimodal inputs (text+image+file) are required.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions