Devstral Medium vs Mistral Small 3.2 24B
For most developer and API use cases, Mistral Small 3.2 24B is the practical winner: it beats Devstral Medium on tool calling and constrained rewriting while costing roughly 10× less per-token. Devstral Medium wins only on classification in our tests and may still be chosen when that single metric matters, but it comes at substantially higher I/O cost.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
mistral
Mistral Small 3.2 24B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.075/MTok
Output
$0.200/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, comparisons break down as follows (scores are our 1–5 proxies). Devstral Medium (A) wins classification: A=4 vs B=3; in our testing A is tied for 1st on classification ("tied for 1st with 29 other models out of 53 tested"), which matters for routing and categorization tasks. Mistral Small 3.2 24B (B) wins constrained_rewriting (A=3 vs B=4) and tool_calling (A=3 vs B=4). For tool calling, B’s rank is 18 of 54 (many models share that score) versus A’s rank 47 of 54 — a meaningful advantage for function selection, argument accuracy, and sequencing. On constrained rewriting, B ranks 6 of 53 versus A’s 31 of 53, indicating B is substantially better when you need tight character limits or strict compression. The remaining tests tie: structured_output (4/4; both rank 26 of 54), strategic_analysis (2/2; both rank ~44), creative_problem_solving (2/2), faithfulness (4/4), long_context (4/4), safety_calibration (1/1), persona_consistency (3/3), agentic_planning (4/4), and multilingual (4/4). Long-context parity (both score 4) means neither model has a distinct advantage for retrieval at 30K+ tokens in our suite. Safety calibration is low for both (1), so both models can be permissive on harmful requests in our tests. In short: B is better for function calling and tight-format rewriting, A is slightly better at classification, and most other capabilities are effectively tied in our testing.
Pricing Analysis
Costs are materially different. Per the payload, Devstral Medium charges $0.40 input / $2.00 output per 1k tokens; Mistral Small 3.2 24B charges $0.075 input / $0.20 output per 1k. Assuming a 50/50 input-output token split: at 1M tokens/mo (1,000 mTok) Devstral costs $1,200 (input $200 + output $1,000) vs Mistral $137.50 (input $37.50 + output $100). At 10M tokens/mo: Devstral $12,000 vs Mistral $1,375. At 100M tokens/mo: Devstral $120,000 vs Mistral $13,750. The ~10× gap (priceRatio 10) means teams with sustained high-volume inference — startups, SaaS products, or heavy batch workflows — should prefer Mistral Small 3.2 24B to avoid large monthly bills unless Devstral’s specific classification edge justifies the cost.
Real-World Cost Comparison
Bottom Line
Choose Mistral Small 3.2 24B if: you need low-cost production inference, function/tool calling accuracy, or reliable constrained rewriting (it wins tool_calling and constrained_rewriting and is far cheaper: input $0.075/output $0.20 per 1k). Choose Devstral Medium if: your product prioritizes classification quality (Devstral scores 4 vs 3 and is tied for 1st on classification in our tests) and you can justify ~10× higher I/O spend for that single advantage. If you need a balanced generalist at lower cost, pick Mistral Small 3.2 24B; if classification routing is mission-critical and worth the expense, pick Devstral Medium.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.