Devstral Small 1.1 vs Ministral 3 14B 2512
In our testing, Ministral 3 14B 2512 is the better all-round choice for multi-turn assistants, creative tasks, and persona-driven agents (wins 5 vs 1). Devstral Small 1.1 wins on safety calibration (2 vs 1) and may be preferred where stricter refusal behavior matters; cost is comparable for balanced I/O but Devstral becomes ~1.4x more expensive on output-heavy workloads.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Ministral 3 14B 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.200/MTok
Output
$0.200/MTok
modelpicker.net
Benchmark Analysis
Summary of our 12-test suite results (scores shown are from our testing):
- Ministral 3 14B 2512 wins (in our testing): strategic analysis 4 vs 2 (B ranks 27 of 54 vs A rank 44 of 54) — stronger at nuanced tradeoff reasoning useful for planning/analysis; constrained rewriting 4 vs 3 (B ranks 6 of 53) — much better at tight compression and format-preserving rewrites; creative problem solving 4 vs 2 (B ranks 9 of 54) — better at generating non-obvious, feasible ideas; persona consistency 5 vs 2 (B tied for 1st with 36 others, A rank 51 of 53) — far superior at maintaining character and resisting injection; agentic planning 3 vs 2 (B rank 42 vs A rank 53) — better at goal decomposition and recovery.
- Devstral Small 1.1 wins (in our testing): safety calibration 2 vs 1 (A rank 12 of 55 vs B rank 32 of 55) — Devstral is better at refusing harmful requests while permitting legitimate ones.
- Ties (same score in our testing): structured output 4/4, tool calling 4/4, faithfulness 4/4, classification 4/4 (both tied for 1st with many models), long context 4/4, multilingual 4/4. For these tasks our testing shows parity: both models handle JSON/schema output, function selection/arguments, sticking to source material, routing/classification, retrieval at 30K+ tokens, and non-English output at equivalent levels. Contextual takeaways: if you need a persona-driven assistant, creative ideation, or tight constrained rewriting, Ministral shows measurable advantages in our tests (notably persona consistency 5 vs 2 and constrained rewriting rank 6). If safety calibration is the gating concern (refusal/permission behavior), Devstral is the safer pick in our testing. Both models tie on many practical engineering needs like structured outputs and tool calling.
Pricing Analysis
Costs in the payload are per 1,000-token unit (mTok). Input/output costs: Devstral Small 1.1 = $0.10 input / $0.30 output per mTok; Ministral 3 14B 2512 = $0.20 input / $0.20 output per mTok. Real-world examples (assumptions noted):
- Balanced workload (50% input / 50% output): both models cost $0.20 per mTok (Devstral = 0.10.5 + 0.30.5 = $0.20; Ministral = 0.20.5 + 0.20.5 = $0.20). That yields: 1M tokens (1,000 mTok) = $200; 10M = $2,000; 100M = $20,000.
- Output-heavy workload (10% input / 90% output): Devstral = $0.28/mTok (0.10.1 + 0.30.9); Ministral = $0.20/mTok. Costs: 1M tokens = Devstral $280 vs Ministral $200 (diff $80); 10M = $2,800 vs $2,000 (diff $800); 100M = $28,000 vs $20,000 (diff $8,000). Who should care: high-volume content generators or apps with large output ratios should prefer Ministral to save ~40% on token spend; teams with balanced conversational I/O will see no price delta. All numbers are drawn from the model price fields in the payload and assume the stated I/O splits.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you prioritize safety calibration and strict refusal behavior (safety calibration 2 vs 1 in our tests), or you prefer the model described for software-engineering agents and can accept a smaller context window (131,072 tokens). Choose Ministral 3 14B 2512 if: you need a persona-consistent assistant, stronger creative problem solving, constrained rewriting, or larger context and multimodal input (Ministral scores persona consistency 5 vs 2, creative problem solving 4 vs 2; context window 262,144 and modality text+image->text). Also pick Ministral if your workload is output-heavy — it costs ~$0.20/mTok vs Devstral’s ~$0.28/mTok in output-dominant scenarios.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.