Devstral Small 1.1 vs Ministral 3 3B 2512
For most production use cases pick Ministral 3 3B 2512: it wins 5 of 12 benchmarks in our testing and costs $0.1 vs $0.3 per output mTok. Choose Devstral Small 1.1 only if safety calibration is your primary requirement — it scores higher there — and you accept the roughly 3x output cost.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Ministral 3 3B 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.100/MTok
modelpicker.net
Benchmark Analysis
We ran a 12-test suite and compared per-task scores and ranked positions — all statements below are from our testing. Ministral 3 3B 2512 wins five tests: faithfulness (B 5 vs A 4; B is tied for 1st with 32 others out of 55 tested), constrained rewriting (B 5 vs A 3; B tied for 1st with 4 others), creative problem solving (B 3 vs A 2; B ranks 30 of 54), persona consistency (B 4 vs A 2; B ranks 38 of 53 while Devstral ranks 51 of 53), and agentic planning (B 3 vs A 2; B ranks 42 of 54 vs Devstral's rank 53 of 54). Devstral Small 1.1 wins safety calibration (A 2 vs B 1; Devstral ranks 12 of 55 with many tied). Six tests are ties: structured output (4/4; both rank 26 of 54), strategic analysis (2/2; both rank 44 of 54), tool calling (4/4; both rank 18 of 54), classification (4/4; both tied for 1st with 29 others), long context (4/4; both rank 38 of 55), and multilingual (4/4; both rank 36 of 55). Practically: Ministral's higher faithfulness and constrained rewriting scores mean better adherence to source material and tighter compression into strict character limits; its persona consistency and agentic planning advantages translate to fewer role-injection failures and stronger goal decomposition in our tests. Devstral's safety calibration edge means it more often refuses harmful requests appropriately in our evaluation, which matters in regulated or high-risk deployments. For tool workflows, classification, long context, and structured-output work, the two models are effectively tied in our testing.
Pricing Analysis
Both models charge $0.1 per input mTok. Output cost differs: Ministral 3 3B 2512 is $0.1/mTok, Devstral Small 1.1 is $0.3/mTok (price ratio ≈ 3x). At 1M tokens/month (1,000 mTok) with a 50/50 input/output split, Ministral costs $100/month (0.1*1000), Devstral costs $200/month — a $100 difference. At 10M tokens (10,000 mTok) with 50/50 split, Ministral ≈ $1,000 vs Devstral ≈ $2,000 — $1,000 gap. At 100M tokens, Ministral ≈ $10,000 vs Devstral ≈ $20,000 — $10,000 gap. If your workload is output-heavy (e.g., 80% output), the gap grows (1M tokens: $100 vs $260). High-volume text generation, chat, or multi-tenant APIs should care about this delta; small-scale experiments will feel it less.
Real-World Cost Comparison
Bottom Line
Choose Ministral 3 3B 2512 if you need the lowest-cost production model with stronger faithfulness, constrained rewriting, and better persona consistency and planning in our testing — and if you want multimodal input (text+image->text). Choose Devstral Small 1.1 if safety calibration is a top priority and you are willing to pay roughly 3x for output tokens ($0.3 vs $0.1 per output mTok). If you depend primarily on tool calling, classification, long-context retrieval, or structured-output, either model performs similarly in our benchmarks.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.