Mistral Small 4 vs Mistral Small 3.2 24B
In our testing Mistral Small 4 is the better generalist: it wins 6 of 12 benchmarks and is stronger at structured output, creative problem solving, multilingual output and persona consistency. Mistral Small 3.2 24B wins on constrained rewriting and classification and is materially cheaper, so choose it when cost-per-token matters.
mistral
Mistral Small 4
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.600/MTok
modelpicker.net
mistral
Mistral Small 3.2 24B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.075/MTok
Output
$0.200/MTok
modelpicker.net
Benchmark Analysis
All benchmark statements below are from our testing. Overall wins: Mistral Small 4 (A) wins 6 categories, Mistral Small 3.2 24B (B) wins 2, and 4 are ties. Details: - Structured_output: A scores 5 vs B 4; A is tied for 1st in our ranking ("tied for 1st with 24 other models out of 54 tested"). This matters for strict JSON/schema outputs — Small 4 will more reliably follow format constraints. - Creative_problem_solving: A 4 vs B 2; A ranks "rank 9 of 54 (21 models share this score)" while B ranks "rank 47 of 54...". For ideation or non-obvious solutions, Small 4 produces more specific, feasible ideas. - Safety_calibration: A 2 vs B 1; A ranks "rank 12 of 55 (20 models share this score)" vs B "rank 32 of 55...". Small 4 better balances refusal and permitting legitimate requests in our tests. - Persona_consistency: A 5 vs B 3; A is "tied for 1st with 36 other models out of 53 tested" while B is "rank 45 of 53...". If maintaining character/role is important, Small 4 is stronger. - Multilingual: A 5 vs B 4; A is "tied for 1st with 34 other models out of 55 tested" while B is "rank 36 of 55..." — Small 4 gives higher-quality non-English outputs in our suite. - Strategic_analysis: A 4 vs B 2; A "rank 27 of 54" vs B "rank 44 of 54" — for nuanced tradeoff reasoning with numbers, Small 4 is more capable. Where B wins: - Constrained_rewriting: B 4 vs A 3; B ranks "rank 6 of 53 (25 models share this score)" compared to A "rank 31 of 53...". For tight character-limited rewriting (compression), Small 3.2 24B performed better. - Classification: B 3 vs A 2; B is "rank 31 of 53" vs A "rank 51 of 53" — B is more reliable for routing and simple categorization. Ties (identical scores in our tests): tool calling (both 4; both show "rank 18 of 54..."), faithfulness (both 4; A "rank 34 of 55..." and B "rank 34 of 55..."), long context (both 4; both "rank 38 of 55..."), and agentic planning (both 4; both "rank 16 of 54..."). These ties mean both models behave similarly for function selection, retrieval over long context, and goal decomposition in our suite. In short: Small 4 trades higher per-token cost for clearer wins in structured formats, creative problem solving, multilingual output, persona consistency and safety calibration; Small 3.2 24B is cheaper and better at constrained rewriting and basic classification.
Pricing Analysis
Per the payload, Mistral Small 4 costs $0.15 input + $0.60 output = $0.75 per mTok; Mistral Small 3.2 24B costs $0.075 input + $0.20 output = $0.275 per mTok. At 1M tokens/month (1,000 mTok) that is $750 vs $275. At 10M tokens it's $7,500 vs $2,750. At 100M tokens it's $75,000 vs $27,500. The payload lists a priceRatio of 3 (Small 4 is roughly three times the list-rate). Teams with heavy output volumes (analytics, large-scale chat or content generation) should care about the ~2.7–3x cost gap; smaller projects or those that need the specific quality wins of Small 4 may absorb the premium.
Real-World Cost Comparison
Bottom Line
Choose Mistral Small 4 if you need reliable schema/JSON outputs, stronger creative problem solving, multilingual parity, better persona consistency, or improved safety calibration — you pay a premium (~$0.75/mTok total). Choose Mistral Small 3.2 24B if your priority is cost-efficiency (about $0.275/mTok total) or you need the best constrained rewriting and classification performance from these two models.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.