Mistral Small 4 vs Mistral Small 3.2 24B

In our testing Mistral Small 4 is the better generalist: it wins 6 of 12 benchmarks and is stronger at structured output, creative problem solving, multilingual output and persona consistency. Mistral Small 3.2 24B wins on constrained rewriting and classification and is materially cheaper, so choose it when cost-per-token matters.

mistral

Mistral Small 4

Overall
3.83/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window262K

modelpicker.net

mistral

Mistral Small 3.2 24B

Overall
3.25/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.075/MTok

Output

$0.200/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

All benchmark statements below are from our testing. Overall wins: Mistral Small 4 (A) wins 6 categories, Mistral Small 3.2 24B (B) wins 2, and 4 are ties. Details: - Structured_output: A scores 5 vs B 4; A is tied for 1st in our ranking ("tied for 1st with 24 other models out of 54 tested"). This matters for strict JSON/schema outputs — Small 4 will more reliably follow format constraints. - Creative_problem_solving: A 4 vs B 2; A ranks "rank 9 of 54 (21 models share this score)" while B ranks "rank 47 of 54...". For ideation or non-obvious solutions, Small 4 produces more specific, feasible ideas. - Safety_calibration: A 2 vs B 1; A ranks "rank 12 of 55 (20 models share this score)" vs B "rank 32 of 55...". Small 4 better balances refusal and permitting legitimate requests in our tests. - Persona_consistency: A 5 vs B 3; A is "tied for 1st with 36 other models out of 53 tested" while B is "rank 45 of 53...". If maintaining character/role is important, Small 4 is stronger. - Multilingual: A 5 vs B 4; A is "tied for 1st with 34 other models out of 55 tested" while B is "rank 36 of 55..." — Small 4 gives higher-quality non-English outputs in our suite. - Strategic_analysis: A 4 vs B 2; A "rank 27 of 54" vs B "rank 44 of 54" — for nuanced tradeoff reasoning with numbers, Small 4 is more capable. Where B wins: - Constrained_rewriting: B 4 vs A 3; B ranks "rank 6 of 53 (25 models share this score)" compared to A "rank 31 of 53...". For tight character-limited rewriting (compression), Small 3.2 24B performed better. - Classification: B 3 vs A 2; B is "rank 31 of 53" vs A "rank 51 of 53" — B is more reliable for routing and simple categorization. Ties (identical scores in our tests): tool calling (both 4; both show "rank 18 of 54..."), faithfulness (both 4; A "rank 34 of 55..." and B "rank 34 of 55..."), long context (both 4; both "rank 38 of 55..."), and agentic planning (both 4; both "rank 16 of 54..."). These ties mean both models behave similarly for function selection, retrieval over long context, and goal decomposition in our suite. In short: Small 4 trades higher per-token cost for clearer wins in structured formats, creative problem solving, multilingual output, persona consistency and safety calibration; Small 3.2 24B is cheaper and better at constrained rewriting and basic classification.

BenchmarkMistral Small 4Mistral Small 3.2 24B
Faithfulness4/54/5
Long Context4/54/5
Multilingual5/54/5
Tool Calling4/54/5
Classification2/53/5
Agentic Planning4/54/5
Structured Output5/54/5
Safety Calibration2/51/5
Strategic Analysis4/52/5
Persona Consistency5/53/5
Constrained Rewriting3/54/5
Creative Problem Solving4/52/5
Summary6 wins2 wins

Pricing Analysis

Per the payload, Mistral Small 4 costs $0.15 input + $0.60 output = $0.75 per mTok; Mistral Small 3.2 24B costs $0.075 input + $0.20 output = $0.275 per mTok. At 1M tokens/month (1,000 mTok) that is $750 vs $275. At 10M tokens it's $7,500 vs $2,750. At 100M tokens it's $75,000 vs $27,500. The payload lists a priceRatio of 3 (Small 4 is roughly three times the list-rate). Teams with heavy output volumes (analytics, large-scale chat or content generation) should care about the ~2.7–3x cost gap; smaller projects or those that need the specific quality wins of Small 4 may absorb the premium.

Real-World Cost Comparison

TaskMistral Small 4Mistral Small 3.2 24B
iChat response<$0.001<$0.001
iBlog post$0.0013<$0.001
iDocument batch$0.033$0.011
iPipeline run$0.330$0.115

Bottom Line

Choose Mistral Small 4 if you need reliable schema/JSON outputs, stronger creative problem solving, multilingual parity, better persona consistency, or improved safety calibration — you pay a premium (~$0.75/mTok total). Choose Mistral Small 3.2 24B if your priority is cost-efficiency (about $0.275/mTok total) or you need the best constrained rewriting and classification performance from these two models.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions