Devstral Small 1.1 vs o4 Mini
o4 Mini is the stronger general-purpose model, outscoring Devstral Small 1.1 on 9 of 12 benchmarks in our testing — including tool calling (5 vs 4), agentic planning (4 vs 2), and strategic analysis (5 vs 2). Devstral Small 1.1 wins only on safety calibration (2 vs 1) and ties on classification and constrained rewriting. The tradeoff is real: o4 Mini costs $4.40/M output tokens versus $0.30/M for Devstral Small 1.1 — nearly 15x more expensive — so Devstral Small 1.1 earns its place for high-volume, cost-sensitive coding pipelines where its SWE-focused design can compensate for lower general capability scores.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
o4 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$1.10/MTok
Output
$4.40/MTok
modelpicker.net
Benchmark Analysis
Our 12-test internal benchmark suite shows o4 Mini ahead on 9 tests, tied on 2, and behind on 1.
Where o4 Mini wins decisively:
- Strategic analysis: 5 vs 2. o4 Mini ties for 1st among 54 models; Devstral Small 1.1 ranks 44th. This gap is substantial — Devstral Small 1.1's score of 2 sits below the 25th percentile (p25 = 3) for this test, meaning it underperforms the majority of models we track on nuanced tradeoff reasoning.
- Agentic planning: 4 vs 2. o4 Mini ranks 16th of 54; Devstral Small 1.1 ranks 53rd of 54 — near the bottom of the entire field. For goal decomposition and failure recovery in autonomous workflows, this is a major liability.
- Creative problem solving: 4 vs 2. o4 Mini ranks 9th of 54; Devstral Small 1.1 ranks 47th. Again, Devstral's score of 2 falls below the p25 threshold for this benchmark.
- Persona consistency: 5 vs 2. o4 Mini ties for 1st among 53 models; Devstral Small 1.1 ranks 51st. Not relevant to coding tasks, but critical for chat products and system-prompt-based applications.
- Tool calling: 5 vs 4. Both are above the median (p50 = 4), but o4 Mini ties for 1st among 54 models while Devstral Small 1.1 ranks 18th. In practice, o4 Mini's higher score means more reliable function selection and argument accuracy in tool-heavy pipelines.
- Faithfulness: 5 vs 4. o4 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 34th. For RAG and document Q&A, this matters.
- Long context: 5 vs 4. o4 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 38th. o4 Mini also carries a larger context window (200K vs 128K tokens).
- Structured output: 5 vs 4. o4 Mini ties for 1st among 54 models; Devstral Small 1.1 ranks 26th.
- Multilingual: 5 vs 4. o4 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 36th.
Where models tie:
- Classification: both score 4, tied for 1st among 53 models (30 models share this score). No meaningful difference here.
- Constrained rewriting: both score 3, tied at rank 31 of 53. Neither excels at compression within hard character limits.
Where Devstral Small 1.1 wins:
- Safety calibration: 2 vs 1. Devstral Small 1.1 ranks 12th of 55 (tied with 19 others); o4 Mini ranks 32nd. Note that a score of 2 still sits at the median for this benchmark (p50 = 2), so this is less a Devstral strength and more an o4 Mini weakness. Teams building products where over-refusal or under-refusal is a compliance risk should factor this in.
External benchmarks (Epoch AI): o4 Mini scores 97.8% on MATH Level 5 (rank 2 of 14 models with this data, tied with 2 others) and 81.7% on AIME 2025 (rank 13 of 23, sole holder of that exact score). These place o4 Mini among the top math-capable models by third-party measure. Devstral Small 1.1 has no external benchmark scores in our dataset, so no direct comparison is possible on those dimensions.
Pricing Analysis
Devstral Small 1.1 costs $0.10/M input tokens and $0.30/M output tokens. o4 Mini costs $1.10/M input and $4.40/M output — 11x more on input and 14.7x more on output.
At 1M output tokens/month: Devstral Small 1.1 costs $0.30 vs o4 Mini's $4.40 — a $4.10 difference that's negligible for most teams.
At 10M output tokens/month: $3 vs $44 — a $41 monthly gap that starts mattering for production workloads.
At 100M output tokens/month: $300 vs $4,400 — a $4,100/month difference that makes model selection a meaningful budget decision.
Developers running high-throughput agentic pipelines, code generation at scale, or multi-turn chat products should treat this cost gap as a first-order concern. o4 Mini also carries a quirk worth noting for API users: it uses reasoning tokens, enforces a minimum of 1,000 max completion tokens, and works best with high max_completion_tokens set — which can push actual costs higher than the listed rate suggests if reasoning overhead is significant.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you're running high-volume code generation or SWE agent pipelines where cost is a primary constraint — at $0.30/M output tokens it's 14.7x cheaper than o4 Mini, and its description indicates it was purpose-built for software engineering agents. It also edges out o4 Mini on safety calibration, which matters if refusal behavior is a product requirement. It accepts text input only, so it fits text-only coding workflows cleanly.
Choose o4 Mini if: you need a general-purpose reasoning model that performs reliably across agentic planning, strategic analysis, tool calling, long context, and math. Its 81.7% AIME 2025 score (Epoch AI) and 97.8% MATH Level 5 score confirm strong quantitative reasoning beyond our internal tests. It also supports image and file inputs — useful for multimodal tasks Devstral Small 1.1 cannot handle at all. Budget for the higher token cost and set max_completion_tokens high to avoid hitting its minimum threshold quirk.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.