Devstral Small 1.1 vs Ministral 3 14B 2512

In our testing, Ministral 3 14B 2512 is the better all-round choice for multi-turn assistants, creative tasks, and persona-driven agents (wins 5 vs 1). Devstral Small 1.1 wins on safety calibration (2 vs 1) and may be preferred where stricter refusal behavior matters; cost is comparable for balanced I/O but Devstral becomes ~1.4x more expensive on output-heavy workloads.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Ministral 3 14B 2512

Overall
3.75/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.200/MTok

Output

$0.200/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Summary of our 12-test suite results (scores shown are from our testing):

  • Ministral 3 14B 2512 wins (in our testing): strategic analysis 4 vs 2 (B ranks 27 of 54 vs A rank 44 of 54) — stronger at nuanced tradeoff reasoning useful for planning/analysis; constrained rewriting 4 vs 3 (B ranks 6 of 53) — much better at tight compression and format-preserving rewrites; creative problem solving 4 vs 2 (B ranks 9 of 54) — better at generating non-obvious, feasible ideas; persona consistency 5 vs 2 (B tied for 1st with 36 others, A rank 51 of 53) — far superior at maintaining character and resisting injection; agentic planning 3 vs 2 (B rank 42 vs A rank 53) — better at goal decomposition and recovery.
  • Devstral Small 1.1 wins (in our testing): safety calibration 2 vs 1 (A rank 12 of 55 vs B rank 32 of 55) — Devstral is better at refusing harmful requests while permitting legitimate ones.
  • Ties (same score in our testing): structured output 4/4, tool calling 4/4, faithfulness 4/4, classification 4/4 (both tied for 1st with many models), long context 4/4, multilingual 4/4. For these tasks our testing shows parity: both models handle JSON/schema output, function selection/arguments, sticking to source material, routing/classification, retrieval at 30K+ tokens, and non-English output at equivalent levels. Contextual takeaways: if you need a persona-driven assistant, creative ideation, or tight constrained rewriting, Ministral shows measurable advantages in our tests (notably persona consistency 5 vs 2 and constrained rewriting rank 6). If safety calibration is the gating concern (refusal/permission behavior), Devstral is the safer pick in our testing. Both models tie on many practical engineering needs like structured outputs and tool calling.
BenchmarkDevstral Small 1.1Ministral 3 14B 2512
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis2/54/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins5 wins

Pricing Analysis

Costs in the payload are per 1,000-token unit (mTok). Input/output costs: Devstral Small 1.1 = $0.10 input / $0.30 output per mTok; Ministral 3 14B 2512 = $0.20 input / $0.20 output per mTok. Real-world examples (assumptions noted):

  • Balanced workload (50% input / 50% output): both models cost $0.20 per mTok (Devstral = 0.10.5 + 0.30.5 = $0.20; Ministral = 0.20.5 + 0.20.5 = $0.20). That yields: 1M tokens (1,000 mTok) = $200; 10M = $2,000; 100M = $20,000.
  • Output-heavy workload (10% input / 90% output): Devstral = $0.28/mTok (0.10.1 + 0.30.9); Ministral = $0.20/mTok. Costs: 1M tokens = Devstral $280 vs Ministral $200 (diff $80); 10M = $2,800 vs $2,000 (diff $800); 100M = $28,000 vs $20,000 (diff $8,000). Who should care: high-volume content generators or apps with large output ratios should prefer Ministral to save ~40% on token spend; teams with balanced conversational I/O will see no price delta. All numbers are drawn from the model price fields in the payload and assume the stated I/O splits.

Real-World Cost Comparison

TaskDevstral Small 1.1Ministral 3 14B 2512
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.014
iPipeline run$0.170$0.140

Bottom Line

Choose Devstral Small 1.1 if: you prioritize safety calibration and strict refusal behavior (safety calibration 2 vs 1 in our tests), or you prefer the model described for software-engineering agents and can accept a smaller context window (131,072 tokens). Choose Ministral 3 14B 2512 if: you need a persona-consistent assistant, stronger creative problem solving, constrained rewriting, or larger context and multimodal input (Ministral scores persona consistency 5 vs 2, creative problem solving 4 vs 2; context window 262,144 and modality text+image->text). Also pick Ministral if your workload is output-heavy — it costs ~$0.20/mTok vs Devstral’s ~$0.28/mTok in output-dominant scenarios.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions