Question 1

Is Devstral 2 2512 better than Devstral Small 1.1?

Accepted Answer

In our 12‑test suite Devstral 2 2512 wins 8 categories to Devstral Small 1.1's 2 (winLossTie). Devstral 2 2512 scores higher on structured_output (5 vs 4), constrained_rewriting (5 vs 3), long_context (5 vs 4), and agentic_planning (4 vs 2). Small 1.1 wins classification (4 vs 3) and safety_calibration (2 vs 1).

Question 2

Which model is cheaper to run?

Accepted Answer

Devstral Small 1.1 is much cheaper: $0.10 per 1k input and $0.30 per 1k output vs Devstral 2 2512's $0.40/$2.00. Using a 50/50 input/output split, 1M tokens/month costs ≈ $200 with Small 1.1 vs ≈ $1,200 with Devstral 2 2512; 10M tokens ≈ $2,000 vs $12,000; 100M tokens ≈ $20,000 vs $120,000.

Question 3

Which is better for coding and agentic workflows?

Accepted Answer

Devstral 2 2512: agentic_planning 4 vs 2 and constrained_rewriting 5 vs 3. It also has a larger context window (262,144 vs 131,072), which helps multi‑file code reasoning and long prompts.

Question 4

Which is better at safety and classification?

Accepted Answer

Devstral Small 1.1: classification 4 vs 3 (B tied for 1st among 53), and safety_calibration 2 vs 1 (B ranks 12 of 55 vs A rank 32 of 55), so Small 1.1 is better in our tests at routing/labeling and calibrated refusals.

Question 5

Are there tasks where they tie?

Accepted Answer

Yes — tool_calling (4 vs 4) and faithfulness (4 vs 4) are ties in our suite, meaning both models performed similarly at choosing/formatting function calls and sticking to source material in our tests.

Devstral 2 2512 vs Devstral Small 1.1

Devstral 2 2512

Devstral Small 1.1

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions