Devstral Small 1.1 vs GPT-4o-mini
For most production chat and agentic workflows pick GPT-4o-mini: it wins more benchmark categories in our testing (safety calibration 4 vs 2, persona consistency 4 vs 2, agentic planning 3 vs 2) and supports multimodal inputs. Choose Devstral Small 1.1 when cost and faithfulness matter — it costs ~50% less on typical token mixes and scores higher on faithfulness (4 vs 3).
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
GPT-4o-mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.600/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite the two models tie on eight tasks, GPT-4o-mini wins three, and Devstral Small 1.1 wins one. Detailed walkthrough (our scores):
- Wins for GPT-4o-mini: safety calibration 4 vs 2 (GPT rank 6 of 55; Devstral rank 12 of 55). This implies GPT-4o-mini is substantially better at refusing harmful requests while permitting legitimate ones in our testing. persona consistency 4 vs 2 (GPT rank 38 of 53; Devstral rank 51 of 53) — GPT-4o-mini better maintains character and resists injection. agentic planning 3 vs 2 (GPT rank 42 of 54; Devstral rank 53 of 54) — GPT-4o-mini showed stronger goal decomposition and failure recovery in our agentic tests.
- Win for Devstral Small 1.1: faithfulness 4 vs 3 (Devstral rank 34 of 55; GPT rank 52 of 55). In practice this means Devstral is more likely to stick to source material and avoid mild hallucinations on our tasks.
- Ties (equal scores): structured output 4/4, strategic analysis 2/2, constrained rewriting 3/3, creative problem solving 2/2, tool calling 4/4, classification 4/4, long context 4/4, multilingual 4/4. For these, both models performed similarly in our testing: e.g., both tied for top classification performance (tied for 1st with 29 other models), both matched on tool calling and long-context retrieval at 30K+ tokens.
- External math benchmarks (Epoch AI): GPT-4o-mini posts 52.6% on MATH Level 5 and 6.9% on AIME 2025 (Epoch AI). These external scores are supplementary data points and reflect task-specific math performance reported by Epoch AI, not our internal 1–5 scores. In short: GPT-4o-mini is stronger where safety, persona, and resilient planning matter; Devstral is measurably more faithful and substantially cheaper. Many common tasks (classification, tool calling, long context, multilingual outputs) were ties in our tests, so cost and modality become deciding factors for those use cases.
Pricing Analysis
Costs are per 1,000 tokens (mTOK). Devstral Small 1.1: input $0.10/mTOK, output $0.30/mTOK. GPT-4o-mini: input $0.15/mTOK, output $0.60/mTOK. Assuming a 50/50 split of input/output tokens, total monthly costs are: 1M tokens — Devstral $200 vs GPT-4o-mini $375; 10M tokens — Devstral $2,000 vs GPT-4o-mini $3,750; 100M tokens — Devstral $20,000 vs GPT-4o-mini $37,500. The gap grows linearly with usage; teams running tens of millions of tokens/month should care (savings of $1,750 at 10M, $17,500 at 100M). If your app is latency/feature-constrained rather than token-cost constrained, the higher GPT-4o-mini price may be justifiable for its safety and persona gains.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you need lower token costs at scale (about half the combined token price on a 50/50 input/output split), higher faithfulness (score 4 vs 3), or a model optimized for software-engineering agent workflows per its description. Choose GPT-4o-mini if: you prioritize safety and persona consistency (safety calibration 4 vs 2, persona consistency 4 vs 2), need better agentic planning (3 vs 2), or require multimodal inputs (text+image+file → text). If your workload is classification, tool calling, long-context retrieval, or multilingual output, both models tied in our tests — pick by cost, modality, or integration requirements.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.