Devstral Small 1.1 vs Mistral Small 3.2 24B

Mistral Small 3.2 24B is the better pick for agent-style workflows and tight-format rewriting — it wins 3 of 12 benchmarks in our testing. Devstral Small 1.1 wins classification and safety calibration, but costs more (combined $0.40 per M tokens vs $0.275 per M for 3.2 24B).

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Mistral Small 3.2 24B

Overall
3.25/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.075/MTok

Output

$0.200/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

We compare both models across our 12-test suite (scores 1–5) and report ranks where available. Wins, losses, and ties in our testing: Devstral Small 1.1 wins classification (4 vs 3) and safety_calibration (2 vs 1). Classification: Devstral ties for 1st of 53 models in our tests (tied with 29 others), so it's top-tier for routing, labeling, and triage tasks. Safety_calibration: Devstral ranks 12 of 55 vs Mistral's 32 of 55, indicating Devstral is more likely to refuse harmful prompts and permit legitimate ones in our testing. Mistral Small 3.2 24B wins constrained_rewriting (4 vs 3), persona_consistency (3 vs 2), and agentic_planning (4 vs 2). Constrained_rewriting is a clear Mistral advantage — it ranks 6 of 53 (one of the best in our pool) making 3.2 24B superior for compression and strict character/byte-limited transformations. Agentic_planning (rank 16 of 54 for 3.2 24B vs rank 53 of 54 for Devstral) and persona_consistency (rank 45 vs Devstral rank 51) show Mistral outperforms Devstral for multi-step decomposition, failure recovery, and holding a character/role. The rest of the suite ties: structured_output (4/4, both rank 26/54), tool_calling (4/4, both rank 18/54), faithfulness (4/4, both rank 34/55), long_context (4/4, both rank 38/55), multilingual (4/4, both rank 36/55), strategic_analysis (2/2, both rank 44/54), and creative_problem_solving (2/2, both rank 47/54). In practice that means for schema adherence, function selection, multi-30K token retrieval, multilingual parity, and staying close to sources, both models perform similarly in our tests.

BenchmarkDevstral Small 1.1Mistral Small 3.2 24B
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/54/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis2/52/5
Persona Consistency2/53/5
Constrained Rewriting3/54/5
Creative Problem Solving2/52/5
Summary2 wins3 wins

Pricing Analysis

Pricing per million tokens (input+output combined): Devstral Small 1.1 = $0.10 (input) + $0.30 (output) = $0.40 per 1M tokens. Mistral Small 3.2 24B = $0.075 + $0.20 = $0.275 per 1M tokens. At scale this gap matters: for 1M tokens/month you pay $0.40 vs $0.275 (save $0.125, 31.25% cheaper with 3.2 24B). For 10M tokens/month the monthly bill is $4.00 vs $2.75 (save $1.25). For 100M tokens/month it's $40.00 vs $27.50 (save $12.50). Teams doing large-volume inference, high-throughput routing, or multi-tenant APIs should prefer the lower per-token cost of Mistral Small 3.2 24B; teams where small absolute cost differences don't matter but classification accuracy or stricter safety refusals matter may accept Devstral's higher price.

Real-World Cost Comparison

TaskDevstral Small 1.1Mistral Small 3.2 24B
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.011
iPipeline run$0.170$0.115

Bottom Line

Choose Devstral Small 1.1 if you need best-in-class classification and safer default refusals in production: it scores 4/5 on classification (tied for 1st of 53) and 2/5 on safety_calibration (rank 12/55) and offers a slightly larger context window (131,072 vs 128,000). Choose Mistral Small 3.2 24B if you need better agentic planning, persona consistency, or constrained rewriting at lower cost: it scores 4/5 on agentic_planning (rank 16/54), 4/5 on constrained_rewriting (rank 6/53), and costs $0.275 per combined 1M tokens vs $0.40 for Devstral.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions