Mistral Small 3.1 24B vs Mistral Small 3.2 24B

For most developers and high-volume deployments, Mistral Small 3.2 24B is the better pick: it wins 4 of 12 benchmarks (including tool calling) and is far cheaper per token. Mistral Small 3.1 24B still wins where long-context retrieval and strategic analysis matter (long context 5 vs 4; strategic analysis 3 vs 2).

mistral

Mistral Small 3.1 24B

Overall
2.92/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
4/5
Tool Calling
1/5
Classification
3/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
3/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.350/MTok

Output

$0.560/MTok

Context Window128K

modelpicker.net

mistral

Mistral Small 3.2 24B

Overall
3.25/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.075/MTok

Output

$0.200/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Mistral Small 3.2 24B wins 4 benchmarks, Mistral Small 3.1 24B wins 2, and 6 benchmarks tie (see payload). Detailed breakdown:

  • tool calling: 3.2 scores 4 vs 3.1's 1. Rankings reflect this gap — 3.2 ranks 18 of 54 (tied with 28) while 3.1 ranks 53 of 54. This matters for function selection, argument accuracy, and sequencing in agentic flows.
  • constrained rewriting: 3.2 scores 4 vs 3.1's 3; 3.2 ranks 6 of 53 vs 3.1 rank 31. If you need tight-format rewriting (hard character limits, compression), 3.2 is better.
  • persona consistency: 3.2 scores 3 vs 3.1's 2; 3.2 ranks 45 of 53 vs 3.1 rank 51. 3.2 resists injection and holds character better in our tests.
  • agentic planning: 3.2 scores 4 vs 3.1's 3; 3.2 ranks 16 of 54 vs 3.1 rank 42. For goal decomposition and recovery, 3.2 is noticeably stronger.
  • long context: 3.1 scores 5 vs 3.2's 4; 3.1 is tied for 1st with 36 other models on long context (retrieval accuracy at 30K+ tokens) while 3.2 ranks 38 of 55. Choose 3.1 when relying on large-context retrieval accuracy.
  • strategic analysis: 3.1 scores 3 vs 3.2's 2; 3.1 ranks 36 of 54 vs 3.2 rank 44. For nuanced tradeoff reasoning with real numbers, 3.1 has the edge. Ties (equal scores) include structured output (both 4), creative problem solving (both 2), faithfulness (both 4), classification (both 3), safety calibration (both 1), and multilingual (both 4). Structured_output being equal means JSON/schema adherence is similar in our tests. Note also that 3.1 has an explicit quirk: 'no_tool calling' in the payload, which mechanistically explains its low tool calling score.
BenchmarkMistral Small 3.1 24BMistral Small 3.2 24B
Faithfulness4/54/5
Long Context5/54/5
Multilingual4/54/5
Tool Calling1/54/5
Classification3/53/5
Agentic Planning3/54/5
Structured Output4/54/5
Safety Calibration1/51/5
Strategic Analysis3/52/5
Persona Consistency2/53/5
Constrained Rewriting3/54/5
Creative Problem Solving2/52/5
Summary2 wins4 wins

Pricing Analysis

Per-token economics are a major differentiator. Mistral Small 3.1 24B charges $0.35 input + $0.56 output per 1k tokens (total $0.91/1k). Mistral Small 3.2 24B charges $0.075 input + $0.20 output per 1k tokens (total $0.275/1k). At realistic bidirectional usage (1:1 input:output):

  • 1M tokens/month: 3.1 costs $910; 3.2 costs $275.
  • 10M tokens/month: 3.1 costs $9,100; 3.2 costs $2,750.
  • 100M tokens/month: 3.1 costs $91,000; 3.2 costs $27,500. Price ratio in the payload is 2.8x. Teams with high token volumes, narrow margins, or large-scale chat/call automation should prioritize 3.2; small-scale prototypes or cases that specifically need 3.1’s long-context advantage may accept the higher cost.

Real-World Cost Comparison

TaskMistral Small 3.1 24BMistral Small 3.2 24B
iChat response<$0.001<$0.001
iBlog post$0.0013<$0.001
iDocument batch$0.035$0.011
iPipeline run$0.350$0.115

Bottom Line

Choose Mistral Small 3.1 24B if: you need the strongest possible long-context retrieval (long context score 5, tied for 1st) or slightly better strategic analysis (3 vs 2), and you can absorb ~3x higher token costs. Choose Mistral Small 3.2 24B if: you prioritize tool calling, constrained rewriting, persona consistency, or agentic planning (3.2 wins these benchmarks), or you run high-volume production where cost per token matters (3.2 total token price $0.275/1k vs 3.1 $0.91/1k).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions