Devstral Small 1.1 vs o3
o3 is the clear winner on breadth and depth — it outscores Devstral Small 1.1 on 9 of 12 benchmarks in our testing, including decisive leads on agentic planning (5 vs 2), strategic analysis (5 vs 2), and tool calling (5 vs 4). Devstral Small 1.1 wins only on classification (4 vs 3) and safety calibration (2 vs 1), and ties on long context. The cost gap is severe: o3 outputs tokens at $8/M versus Devstral Small 1.1's $0.30/M — a 27x premium that only justifies itself for high-stakes tasks where o3's reasoning depth is genuinely required.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
o3
Benchmark Scores
External Benchmarks
Pricing
Input
$2.00/MTok
Output
$8.00/MTok
modelpicker.net
Benchmark Analysis
Our 12-test benchmark suite gives o3 a decisive win: it scores higher on 9 tests, Devstral Small 1.1 wins 2, and they tie on 1.
Where o3 leads:
- Agentic planning: o3 scores 5, Devstral Small 1.1 scores 2. o3 ties for 1st among 54 models; Devstral Small 1.1 ranks 53rd of 54. This is the starkest gap — goal decomposition and failure recovery are core to autonomous agent workflows, and Devstral Small 1.1 is near the bottom of the field here.
- Strategic analysis: o3 scores 5, Devstral Small 1.1 scores 2. o3 ties for 1st of 54; Devstral Small 1.1 ranks 44th. For nuanced tradeoff reasoning with real numbers, o3 is in a different tier.
- Tool calling: o3 scores 5, Devstral Small 1.1 scores 4. o3 ties for 1st of 54; Devstral Small 1.1 shares a mid-field score (rank 18 of 54). In function-calling pipelines, o3's edge matters for complex argument sequencing.
- Creative problem solving: o3 scores 4 (rank 9 of 54), Devstral Small 1.1 scores 2 (rank 47 of 54). For non-obvious, feasible ideation, Devstral Small 1.1 is well below the field median.
- Persona consistency: o3 scores 5 (tied 1st of 53), Devstral Small 1.1 scores 2 (rank 51 of 53). Critical for chatbot and roleplay applications.
- Structured output: o3 scores 5 (tied 1st of 54), Devstral Small 1.1 scores 4 (tied rank 26 of 54). o3's JSON schema compliance is among the best tested.
- Faithfulness: o3 scores 5 (tied 1st of 55), Devstral Small 1.1 scores 4 (rank 34 of 55). Both exceed the field median (p50 = 5), but o3 reaches the top tier.
- Constrained rewriting: o3 scores 4 (rank 6 of 53), Devstral Small 1.1 scores 3 (rank 31 of 53). o3 compresses within hard character limits more reliably.
- Multilingual: o3 scores 5 (tied 1st of 55), Devstral Small 1.1 scores 4 (rank 36 of 55). Both are above average, but o3 hits the ceiling.
Where Devstral Small 1.1 leads:
- Classification: Devstral Small 1.1 scores 4 (tied 1st of 53 — shared with 29 others), o3 scores 3 (rank 31 of 53). This is a genuine win: Devstral Small 1.1 matches the field's best classifiers while o3 falls below the median.
- Safety calibration: Devstral Small 1.1 scores 2 (rank 12 of 55), o3 scores 1 (rank 32 of 55). Neither model excels here — the field median is 2 and the p75 is also 2 — but Devstral Small 1.1 edges o3 out.
Tied:
- Long context: Both score 4, both rank 38 of 55. Equivalent retrieval accuracy at 30K+ tokens.
External benchmarks (Epoch AI data for o3): On SWE-bench Verified, o3 scores 62.3%, placing it 9th of 12 models with scores in our dataset — near the bottom quartile (p25 = 61.1%) for that test. On MATH Level 5, o3 scores 97.8% (rank 2 of 14, tied with 2 others) — near the top of the tested field and well above the median of 94.2%. On AIME 2025, o3 scores 83.9% (rank 12 of 23), exactly at the field median. Devstral Small 1.1 has no external benchmark scores in our dataset. These third-party figures suggest o3 is a strong competition math model but not a standout on real-world GitHub issue resolution.
Pricing Analysis
Devstral Small 1.1 costs $0.10/M input and $0.30/M output. o3 costs $2.00/M input and $8.00/M output. At 1M output tokens/month, that's $300 vs $8,000 — a $7,700 monthly difference. At 10M output tokens, you're looking at $3,000 vs $80,000. At 100M output tokens, $30,000 vs $800,000. The cost ratio (Devstral Small 1.1 is roughly 2.7% of o3's output cost) means the price-performance calculation depends entirely on task sensitivity. For high-volume coding pipelines, classification workloads, or any application where Devstral Small 1.1's scores are sufficient, the savings are enormous. For one-off complex reasoning tasks — deep strategic analysis, multi-step agentic workflows, or high-stakes tool-calling chains — o3's premium may be worth absorbing. Developers routing at scale should treat o3 as a specialist model for tasks that genuinely require top-tier reasoning, not a general-purpose default.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: You're running high-volume classification or routing pipelines where its top-tier classification score (4, tied 1st of 53) is sufficient, and cost is a primary constraint. At $0.30/M output tokens, it's one of the cheapest capable models available. It also suits applications where safety calibration slightly matters and the task doesn't demand deep reasoning or agentic behavior. Its 131K context window and support for structured outputs, tool calling, and temperature control make it practical for many standard API workflows.
Choose o3 if: Your application involves agentic planning, complex multi-step tool use, strategic analysis, or creative reasoning — areas where o3 scores 5 and Devstral Small 1.1 scores 2. o3's multimodal input (text + image + file) also opens use cases Devstral Small 1.1 can't address. Its 200K context window and 100K max output tokens give it more headroom for long documents. At $8/M output, it's a deliberate investment — justified for tasks where reasoning quality directly affects outcomes, not for bulk inference.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.