Ministral 3 3B 2512 vs Mistral Small 3.2 24B
Winner for most common use cases: Ministral 3 3B 2512 — it wins 5 of 12 benchmarks and is materially cheaper on many workloads. Mistral Small 3.2 24B wins the one benchmark where agentic planning matters (agentic planning 4 vs 3) and is worth considering when goal decomposition and failure recovery are primary requirements — but expect higher output costs.
mistral
Ministral 3 3B 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.100/MTok
modelpicker.net
mistral
Mistral Small 3.2 24B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.075/MTok
Output
$0.200/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Ministral 3 3B 2512 wins 5 benchmarks, Mistral Small 3.2 24B wins 1, and 6 are ties. Detailed walk-through: - Faithfulness: Ministral 3 3B 2512 scores 5 vs 4 and is tied for 1st (rank 1 of 55, tied with 32 models). This matters for tasks needing strict adherence to source material (contracts, citations). - Constrained_rewriting: 5 vs 4 for Ministral 3 3B 2512 (tied for 1st with 4 others) — better for compression into hard limits (SMS, UI snippets). - Classification: 4 vs 3 for Ministral 3 3B 2512 (tied for 1st with 29 others) — more reliable routing and labeling. - Creative_problem_solving: 3 vs 2 for Ministral 3 3B 2512 (rank 30 of 54 vs B rank 47) — A generates more feasible, non-obvious ideas in our tests. - Persona_consistency: 4 vs 3 for Ministral 3 3B 2512 (rank 38 vs B rank 45) — A better resists injection and keeps tone/character. - Agentic_planning: Mistral Small 3.2 24B wins 4 vs 3 and ranks substantially better (B rank 16 of 54 vs A rank 42); pick B when goal decomposition and recovery matter (agents, multi-step orchestration). - Ties (no clear winner in our tests): structured output 4/4 (JSON/schema tasks), tool calling 4/4 (function selection & arguments), long context 4/4 (30k+ retrieval), strategic analysis 2/2 (nuanced tradeoffs), safety calibration 1/1 (refusal/permissiveness), multilingual 4/4. Practical interpretation: Ministral 3 3B 2512 is the stronger option when you need faithfulness, constrained rewriting, classification, and more creative problem solving per token — and it does so at materially lower per-token cost. Mistral Small 3.2 24B stands out when agentic planning quality (rank 16 of 54) is decisive despite higher output pricing.
Pricing Analysis
Per-token rates from the payload: Ministral 3 3B 2512 charges $0.10 per mTok for input and $0.10 per mTok for output. Mistral Small 3.2 24B charges $0.075 per mTok input and $0.20 per mTok output. Using a 50/50 input/output split as an example: for 1M total tokens (500k input + 500k output) Ministral 3 3B 2512 costs $100 (500 * $0.10 + 500 * $0.10 = $50+$50). Mistral Small 3.2 24B costs $137.50 (500 * $0.075 + 500 * $0.20 = $37.50+$100). Scale: at 10M tokens/month those totals scale to $1,000 vs $1,375; at 100M tokens/month $10,000 vs $13,750. Who should care: startups, high-volume API deployments, and embedded products will see five-figure monthly differences at 10M+ tokens; teams that generate large outputs (long generations, transcripts, batch inference) should pay attention to Mistral Small 3.2 24B’s higher output rate ($0.20/mTok). The payload’s priceRatio (0.5) reflects that Ministral 3 3B 2512 is effectively half-cost in many costing comparisons.
Real-World Cost Comparison
Bottom Line
Choose Ministral 3 3B 2512 if you need a cost-efficient general-purpose model with best-in-class faithfulness and constrained rewriting (choosing it saves hundreds to thousands of dollars per month at scale). Use cases: production chat assistants with tight content fidelity, classification/routing systems, SMS/UX-limited rewriting, and image->text tasks where long context is required (context window 131072). Choose Mistral Small 3.2 24B if agentic planning and multi-step orchestration are central (agentic planning 4 vs 3 and B ranks 16 of 54) and you can absorb higher output costs ($0.20 per mTok). Use cases: agent frameworks, automated workflows that require robust goal decomposition and failure recovery.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.