Question 1

Is Devstral Small 1.1 better than GPT-4o-mini?

Accepted Answer

It depends on the metric. In our testing GPT-4o-mini wins more benchmark categories (safety calibration 4 vs 2, persona consistency 4 vs 2, agentic planning 3 vs 2). Devstral Small 1.1 wins on faithfulness (4 vs 3) and is significantly cheaper, while many tasks tied between the two.

Question 2

Which model is cheaper to run?

Accepted Answer

Devstral Small 1.1 is cheaper. Per 1,000 tokens: Devstral input $0.10 / output $0.30; GPT-4o-mini input $0.15 / output $0.60. With a 50/50 input/output split, 1M tokens cost $200 on Devstral vs $375 on GPT-4o-mini.

Question 3

Which model is better for safety and refusal behavior?

Accepted Answer

GPT-4o-mini — it scores 4 vs Devstral's 2 on safety calibration and ranks 6 of 55 vs Devstral's 12 of 55 in our tests, indicating stronger refusal/allow decisions in our benchmarks.

Question 4

Which model is better at sticking to source material (avoiding hallucinations)?

Accepted Answer

Devstral Small 1.1 — it scores 4 on faithfulness vs GPT-4o-mini's 3 in our testing (Devstral rank 34 of 55; GPT rank 52 of 55).

Question 5

Are there tasks where performance is essentially the same?

Accepted Answer

Yes — both models tied on structured output, tool calling, classification, long context, multilingual, constrained rewriting, creative problem solving, and strategic analysis in our tests. For those tasks choose based on price, modality, or integration needs.

Question 6

Does GPT-4o-mini accept images or files?

Accepted Answer

Yes. The payload shows GPT-4o-mini modality as text+image+file→text; Devstral Small 1.1 is text→text only.

Question 7

How do external math benchmarks look?

Accepted Answer

GPT-4o-mini has supplementary scores of 52.6% on MATH Level 5 and 6.9% on AIME 2025 according to Epoch AI. These are external benchmarks and reported separately from our 1–5 test scores.

Devstral Small 1.1 vs GPT-4o-mini

Devstral Small 1.1

GPT-4o-mini

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions