Devstral Small 1.1 vs Mistral Small 3.2 24B
Mistral Small 3.2 24B is the better pick for agent-style workflows and tight-format rewriting — it wins 3 of 12 benchmarks in our testing. Devstral Small 1.1 wins classification and safety calibration, but costs more (combined $0.40 per M tokens vs $0.275 per M for 3.2 24B).
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Mistral Small 3.2 24B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.075/MTok
Output
$0.200/MTok
modelpicker.net
Benchmark Analysis
We compare both models across our 12-test suite (scores 1–5) and report ranks where available. Wins, losses, and ties in our testing: Devstral Small 1.1 wins classification (4 vs 3) and safety_calibration (2 vs 1). Classification: Devstral ties for 1st of 53 models in our tests (tied with 29 others), so it's top-tier for routing, labeling, and triage tasks. Safety_calibration: Devstral ranks 12 of 55 vs Mistral's 32 of 55, indicating Devstral is more likely to refuse harmful prompts and permit legitimate ones in our testing. Mistral Small 3.2 24B wins constrained_rewriting (4 vs 3), persona_consistency (3 vs 2), and agentic_planning (4 vs 2). Constrained_rewriting is a clear Mistral advantage — it ranks 6 of 53 (one of the best in our pool) making 3.2 24B superior for compression and strict character/byte-limited transformations. Agentic_planning (rank 16 of 54 for 3.2 24B vs rank 53 of 54 for Devstral) and persona_consistency (rank 45 vs Devstral rank 51) show Mistral outperforms Devstral for multi-step decomposition, failure recovery, and holding a character/role. The rest of the suite ties: structured_output (4/4, both rank 26/54), tool_calling (4/4, both rank 18/54), faithfulness (4/4, both rank 34/55), long_context (4/4, both rank 38/55), multilingual (4/4, both rank 36/55), strategic_analysis (2/2, both rank 44/54), and creative_problem_solving (2/2, both rank 47/54). In practice that means for schema adherence, function selection, multi-30K token retrieval, multilingual parity, and staying close to sources, both models perform similarly in our tests.
Pricing Analysis
Pricing per million tokens (input+output combined): Devstral Small 1.1 = $0.10 (input) + $0.30 (output) = $0.40 per 1M tokens. Mistral Small 3.2 24B = $0.075 + $0.20 = $0.275 per 1M tokens. At scale this gap matters: for 1M tokens/month you pay $0.40 vs $0.275 (save $0.125, 31.25% cheaper with 3.2 24B). For 10M tokens/month the monthly bill is $4.00 vs $2.75 (save $1.25). For 100M tokens/month it's $40.00 vs $27.50 (save $12.50). Teams doing large-volume inference, high-throughput routing, or multi-tenant APIs should prefer the lower per-token cost of Mistral Small 3.2 24B; teams where small absolute cost differences don't matter but classification accuracy or stricter safety refusals matter may accept Devstral's higher price.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if you need best-in-class classification and safer default refusals in production: it scores 4/5 on classification (tied for 1st of 53) and 2/5 on safety_calibration (rank 12/55) and offers a slightly larger context window (131,072 vs 128,000). Choose Mistral Small 3.2 24B if you need better agentic planning, persona consistency, or constrained rewriting at lower cost: it scores 4/5 on agentic_planning (rank 16/54), 4/5 on constrained_rewriting (rank 6/53), and costs $0.275 per combined 1M tokens vs $0.40 for Devstral.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.