Mistral Medium 3.1 vs Mistral Small 3.2 24B

In our testing Mistral Medium 3.1 is the better choice for production use that prioritizes accuracy, long-context retrieval, multilingual output, and complex planning — it wins 9 of 12 benchmarks. Mistral Small 3.2 24B does not win any benchmarks here but is ~10x cheaper, making it the right pick for high-volume, cost-sensitive deployments or inexpensive prototyping.

mistral

Mistral Medium 3.1

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
5/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

mistral

Mistral Small 3.2 24B

Overall
3.25/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.075/MTok

Output

$0.200/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Summary of head-to-head results in our 12-test suite (score scale 1–5; ranks use the tested pool sizes from our data): - Strategic analysis: Medium 3.1 5 vs Small 3.2 2. Medium tied for 1st ("tied for 1st with 25 others out of 54 tested"); Small ranks 44/54. This matters for nuanced tradeoff reasoning with numbers. - Constrained rewriting: 5 vs 4. Medium tied for 1st; Small ranks 6/53. Medium better at tight-format compression. - Creative problem solving: 3 vs 2. Medium ranks 30/54; Small ranks 47/54 — Medium generates more feasible, non-obvious ideas. - Classification: 4 vs 3. Medium tied for 1st ("tied for 1st with 29 others out of 53"); Small ranks 31/53 — Medium is stronger for accurate routing and tagging. - Long context: 5 vs 4. Medium tied for 1st; Small ranks 38/55 — Medium is preferable for 30K+ token retrieval tasks. - Safety calibration: 2 vs 1. Medium rank 12/55 vs Small 32/55 — Medium is more reliable at refusing harmful requests while permitting legitimate ones. - Persona consistency: 5 vs 3. Medium tied for 1st; Small ranks 45/53 — Medium better resists prompt injection and keeps character. - Agentic planning: 5 vs 4. Medium tied for 1st; Small rank 16/54 — Medium shows stronger goal decomposition and failure recovery. - Multilingual: 5 vs 4. Medium tied for 1st (with 34 others); Small ranks 36/55 — Medium delivers higher-quality non-English output. - Ties (no clear winner): Structured output 4 vs 4 (both rank 26/54) — JSON/schema compliance equal; Tool calling 4 vs 4 (both rank 18/54) — function selection and sequencing similar; Faithfulness 4 vs 4 (both rank 34/55) — both stick to sources equally. Overall: Medium wins 9 tests, Small wins none, and 3 tests are ties. For real tasks this means Medium 3.1 consistently outperforms Small 3.2 24B on high-stakes accuracy, long-context retrieval, multilingual support, planning, and structured/compressed outputs; Small matches Medium on function calling, format adherence, and faithfulness but falls behind on classification and strategic reasoning.

BenchmarkMistral Medium 3.1Mistral Small 3.2 24B
Faithfulness4/54/5
Long Context5/54/5
Multilingual5/54/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning5/54/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis5/52/5
Persona Consistency5/53/5
Constrained Rewriting5/54/5
Creative Problem Solving3/52/5
Summary9 wins0 wins

Pricing Analysis

Costs per model (input+output per mTok): Mistral Medium 3.1 = $0.4 + $2.00 = $2.40 per mTok; Mistral Small 3.2 24B = $0.075 + $0.20 = $0.275 per mTok. Interpreting per-month volumes (1 mTok = 1,000 tokens): - 1M tokens/month (1,000 mTok): Medium 3.1 ≈ $2,400; Small 3.2 24B ≈ $275. - 10M tokens/month: Medium ≈ $24,000; Small ≈ $2,750. - 100M tokens/month: Medium ≈ $240,000; Small ≈ $27,500. The payload lists a priceRatio of 10: Medium is roughly 10x more costly per token. Who should care: teams with heavy, sustained traffic (10M–100M tokens) must budget accordingly and will find the cost delta material; startups and cost-sensitive apps should prefer Mistral Small 3.2 24B unless the accuracy advantages of Medium justify the extra spend.

Real-World Cost Comparison

TaskMistral Medium 3.1Mistral Small 3.2 24B
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.011
iPipeline run$1.08$0.115

Bottom Line

Choose Mistral Medium 3.1 if you need higher accuracy for multilingual support, long-context retrieval, strategic reasoning, agentic planning, constrained rewriting, or production classification — our tests show Medium wins 9 of 12 benchmarks and ranks near the top in those areas. Choose Mistral Small 3.2 24B if your priority is cost-efficiency at scale or cheap experimentation: it’s roughly 10x cheaper per token (≈ $275 vs $2,400 per 1M tokens) and ties Medium on tool calling, structured output, and faithfulness.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions