Devstral Medium vs Ministral 3 14B 2512

In our testing, Ministral 3 14B 2512 is the practical winner for most production use cases—it wins 5 of 12 benchmarks including persona_consistency and tool_calling and is far cheaper. Devstral Medium outperforms only on agentic_planning (4 vs 3) and may be worth the premium if goal decomposition and recovery are your top priority and you can absorb its much higher token costs.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

mistral

Ministral 3 14B 2512

Overall
3.75/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.200/MTok

Output

$0.200/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Ministral 3 14B 2512 wins the majority (5 wins), Devstral Medium wins 1, and 6 tests tie. Detailed per-test summary (score: Devstral → Ministral, with rank context and task meaning):

  • strategic_analysis: 2 → 4 — Ministral wins. In our testing this reflects better nuanced tradeoff reasoning (Ministral ranks 27 of 54). Expect Ministral to handle numeric tradeoffs and multi-constraint decisions more reliably.
  • constrained_rewriting: 3 → 4 — Ministral wins (rank 6 of 53). This means Ministral is substantially better at strict-length or character-limited rewriting tasks (e.g., ad copy under tight limits).
  • structured_output: 4 → 4 — tie (both rank ~26 of 54). Both models handle JSON/schema compliance similarly in our tests.
  • long_context: 4 → 4 — tie (both rank 38 of 55). Both perform comparably on retrieval/accuracy across 30k+ token contexts.
  • persona_consistency: 3 → 5 — Ministral wins (tied for 1st with 36 others). Ministral is far stronger at maintaining character and resisting injection in our testing.
  • agentic_planning: 4 → 3 — Devstral wins (Devstral rank 16 of 54 vs Ministral rank 42). Devstral is better at goal decomposition and failure recovery scenarios in our agentic planning tests.
  • creative_problem_solving: 2 → 4 — Ministral wins (rank 9 of 54). Ministral produces more non-obvious, feasible ideas in our creative tasks.
  • tool_calling: 3 → 4 — Ministral wins (rank 18 of 54). Ministral selects functions, arguments, and sequencing more accurately in our function-calling tests.
  • faithfulness: 4 → 4 — tie (both rank 34 of 55). Both models stick to source material equally in our tests.
  • classification: 4 → 4 — tie (both tied for 1st with 29 others). Both are accurate for routing and categorization tasks in our suite.
  • safety_calibration: 1 → 1 — tie (both low; rank 32 of 55). Both models show weak refusal/permissiveness calibration on harmful requests in our testing and need guardrails.
  • multilingual: 4 → 4 — tie (both rank 36 of 55). Comparable non-English performance in our tests.

Practical interpretation: Ministral 3 14B 2512 is stronger on persona, tool use, creative problem solving, and constrained rewriting — tasks common in production assistants and coding tools. Devstral Medium is narrowly better at agentic planning. For schema adherence, long-context retrieval, classification, and faithfulness the models perform similarly in our tests.

BenchmarkDevstral MediumMinistral 3 14B 2512
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/53/5
Structured Output4/54/5
Safety Calibration1/51/5
Strategic Analysis2/54/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins5 wins

Pricing Analysis

Pricing gap: Devstral Medium charges input $0.40 and output $2.00 per 1k tokens; Ministral 3 14B 2512 charges input $0.20 and output $0.20 per 1k tokens (priceRatio 10). Example costs assuming a 50/50 input/output split: for 1M tokens/month Devstral ≈ $1,200 vs Ministral ≈ $200; for 10M tokens Devstral ≈ $12,000 vs Ministral ≈ $2,000; for 100M tokens Devstral ≈ $120,000 vs Ministral ≈ $20,000. If you treat all tokens as outputs (worst-case), 1M tokens cost Devstral $2,000 vs Ministral $200. Teams at high volume (≥10M tokens/month), startups on tight budgets, and cost-sensitive consumer products should prefer Ministral 3 14B 2512; R&D teams or products where agentic planning accuracy justifies large spend may consider Devstral Medium despite the 6–10x higher bill depending on token mix.

Real-World Cost Comparison

TaskDevstral MediumMinistral 3 14B 2512
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.014
iPipeline run$1.08$0.140

Bottom Line

Choose Ministral 3 14B 2512 if you need the best cost-to-performance balance: it wins 5 benchmarks (persona_consistency, tool_calling, creative_problem_solving, constrained_rewriting, strategic_analysis), costs far less (output $0.20 vs $2.00 per 1k tokens), and is the better default for production assistants, tool-integrated agents, and high-volume deployments. Choose Devstral Medium if your primary need is higher-quality agentic planning (it scores 4 vs 3 in our agentic_planning tests and ranks higher for that task) and you can justify much higher token bills for that specific capability.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions