Devstral Medium vs Mistral Small 3.2 24B

For most developer and API use cases, Mistral Small 3.2 24B is the practical winner: it beats Devstral Medium on tool calling and constrained rewriting while costing roughly 10× less per-token. Devstral Medium wins only on classification in our tests and may still be chosen when that single metric matters, but it comes at substantially higher I/O cost.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

mistral

Mistral Small 3.2 24B

Overall
3.25/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.075/MTok

Output

$0.200/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, comparisons break down as follows (scores are our 1–5 proxies). Devstral Medium (A) wins classification: A=4 vs B=3; in our testing A is tied for 1st on classification ("tied for 1st with 29 other models out of 53 tested"), which matters for routing and categorization tasks. Mistral Small 3.2 24B (B) wins constrained_rewriting (A=3 vs B=4) and tool_calling (A=3 vs B=4). For tool calling, B’s rank is 18 of 54 (many models share that score) versus A’s rank 47 of 54 — a meaningful advantage for function selection, argument accuracy, and sequencing. On constrained rewriting, B ranks 6 of 53 versus A’s 31 of 53, indicating B is substantially better when you need tight character limits or strict compression. The remaining tests tie: structured_output (4/4; both rank 26 of 54), strategic_analysis (2/2; both rank ~44), creative_problem_solving (2/2), faithfulness (4/4), long_context (4/4), safety_calibration (1/1), persona_consistency (3/3), agentic_planning (4/4), and multilingual (4/4). Long-context parity (both score 4) means neither model has a distinct advantage for retrieval at 30K+ tokens in our suite. Safety calibration is low for both (1), so both models can be permissive on harmful requests in our tests. In short: B is better for function calling and tight-format rewriting, A is slightly better at classification, and most other capabilities are effectively tied in our testing.

BenchmarkDevstral MediumMistral Small 3.2 24B
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/53/5
Agentic Planning4/54/5
Structured Output4/54/5
Safety Calibration1/51/5
Strategic Analysis2/52/5
Persona Consistency3/53/5
Constrained Rewriting3/54/5
Creative Problem Solving2/52/5
Summary1 wins2 wins

Pricing Analysis

Costs are materially different. Per the payload, Devstral Medium charges $0.40 input / $2.00 output per 1k tokens; Mistral Small 3.2 24B charges $0.075 input / $0.20 output per 1k. Assuming a 50/50 input-output token split: at 1M tokens/mo (1,000 mTok) Devstral costs $1,200 (input $200 + output $1,000) vs Mistral $137.50 (input $37.50 + output $100). At 10M tokens/mo: Devstral $12,000 vs Mistral $1,375. At 100M tokens/mo: Devstral $120,000 vs Mistral $13,750. The ~10× gap (priceRatio 10) means teams with sustained high-volume inference — startups, SaaS products, or heavy batch workflows — should prefer Mistral Small 3.2 24B to avoid large monthly bills unless Devstral’s specific classification edge justifies the cost.

Real-World Cost Comparison

TaskDevstral MediumMistral Small 3.2 24B
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.011
iPipeline run$1.08$0.115

Bottom Line

Choose Mistral Small 3.2 24B if: you need low-cost production inference, function/tool calling accuracy, or reliable constrained rewriting (it wins tool_calling and constrained_rewriting and is far cheaper: input $0.075/output $0.20 per 1k). Choose Devstral Medium if: your product prioritizes classification quality (Devstral scores 4 vs 3 and is tied for 1st on classification in our tests) and you can justify ~10× higher I/O spend for that single advantage. If you need a balanced generalist at lower cost, pick Mistral Small 3.2 24B; if classification routing is mission-critical and worth the expense, pick Devstral Medium.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions