Devstral 2 2512 vs Devstral Small 1.1
Devstral 2 2512 is the winner for high‑quality agentic coding, long‑context retrieval, and structured outputs (wins 8 of 12 benchmarks). Devstral Small 1.1 is the budget pick—cheaper per token and it wins classification and safety calibration—so choose Small 1.1 if cost or safer refusals matter.
mistral
Devstral 2 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
Overview: In our 12‑test suite Devstral 2 2512 wins 8 categories, Devstral Small 1.1 wins 2, and 2 are ties. Key wins for Devstral 2 2512: structured_output 5 vs 4 (tied for 1st among 54 models — A "tied for 1st with 24 other models"), constrained_rewriting 5 vs 3 (A tied for 1st out of 53), and long_context 5 vs 4 (A tied for 1st out of 55). These scores mean Devstral 2 2512 is substantially more reliable when you need strict JSON/schema adherence, compress content under hard limits, or retrieve/handle 30K+ token contexts (its context_window is 262,144 vs Small's 131,072). Agentic planning and creative problem solving also favor Devstral 2 2512 (agentic_planning 4 vs 2; creative_problem_solving 4 vs 2) — useful for multi‑step coding tasks and proposing feasible, specific solutions. Devstral Small 1.1 wins classification (4 vs 3) and safety_calibration (2 vs 1); classification is tied for 1st (B "tied for 1st with 29 other models out of 53 tested") and safety_calibration ranks higher for B (rank 12 of 55 vs A rank 32 of 55), indicating Small 1.1 is better at accurate routing and more conservative refusal behavior in our tests. Ties: tool_calling (4 vs 4) and faithfulness (4 vs 4) — both models are comparable at selecting/formatting function calls and sticking to source material. Rankings context: Devstral 2 2512 is top‑tier for structured_output, constrained_rewriting, long_context, and multilingual (multilingual 5 vs 4; A tied for 1st), while Devstral Small 1.1 ranks near the top for classification and noticeably better on safety calibration. In practice: pick Devstral 2 2512 when you need robust schema outputs, long context handling, and stronger agentic planning; pick Devstral Small 1.1 when you need the lowest cost per token or priority on classification and safer refusals.
Pricing Analysis
Per‑token rates: Devstral 2 2512 charges $0.40 per 1k input and $2.00 per 1k output; Devstral Small 1.1 charges $0.10 per 1k input and $0.30 per 1k output. Using a 50/50 input/output split as a practical example: at 1M tokens/month (1,000 mtoks total) Devstral 2 2512 ≈ $1,200/month (500 mtoks input × $0.40 = $200; 500 mtoks output × $2.00 = $1,000) vs Devstral Small 1.1 ≈ $200/month (500×$0.10 + 500×$0.30). At 10M tokens/month the costs are ≈ $12,000 vs $2,000; at 100M tokens/month ≈ $120,000 vs $20,000. Enterprises, high‑volume APIs, and production coding agents should care about this gap—Devstral 2 2512 can be 6–6.7× more expensive in typical usage patterns (payload priceRatio = 6.6667). Small teams, prototypes, or cost‑sensitive products will save materially with Devstral Small 1.1.
Real-World Cost Comparison
Bottom Line
Choose Devstral 2 2512 if you need: high‑fidelity structured outputs (5 vs 4), extreme long‑context work (5 vs 4) with a 262,144 token window, constrained rewriting (5 vs 3), or stronger agentic planning and creative problem solving. Expect to pay substantially more ($0.40/$2.00 per 1k input/output). Choose Devstral Small 1.1 if you need: the lowest cost per token ($0.10/$0.30 per 1k), top classification performance (4 vs 3) and better safety calibration (2 vs 1), or are building cost‑sensitive classification/routing services or prototypes.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.