Devstral Small 1.1 vs Ministral 3 8B 2512

Ministral 3 8B 2512 is the stronger general-purpose choice, winning 5 benchmarks — strategic analysis, constrained rewriting, creative problem solving, persona consistency, and agentic planning — while tying 6 others, with Devstral Small 1.1 winning only on safety calibration. Devstral Small 1.1 is purpose-built for software engineering agents and carries a lower output cost ($0.30/Mtok vs $0.15/Mtok), but its benchmark profile outside of structured tasks is weak. At roughly 2x the output cost, Devstral Small 1.1 asks you to pay more for a narrower capability set — justified only if you're running a dedicated coding agent pipeline.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Ministral 3 8B 2512

Overall
3.67/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
3/5
Persona Consistency
5/5
Constrained Rewriting
5/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.150/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Ministral 3 8B 2512 wins 5 benchmarks, Devstral Small 1.1 wins 1, and the two tie on 6.

Where Ministral 3 8B 2512 leads:

  • Persona consistency: 5 vs 2 — Ministral 3 8B 2512 ties for 1st among 53 models tested; Devstral Small 1.1 ranks 51st of 53. This is a decisive gap: Devstral Small 1.1 is among the worst models tested at maintaining character and resisting prompt injection.
  • Constrained rewriting: 5 vs 3 — Ministral 3 8B 2512 ties for 1st among 53 models; Devstral Small 1.1 ranks 31st. For compression tasks with hard character limits — think microcopy, SMS, push notifications — Ministral 3 8B 2512 is substantially stronger.
  • Creative problem solving: 3 vs 2 — Ministral 3 8B 2512 ranks 30th of 54; Devstral Small 1.1 ranks 47th. Neither model excels here relative to the field, but Devstral Small 1.1 sits in the bottom tier.
  • Agentic planning: 3 vs 2 — Ministral 3 8B 2512 ranks 42nd of 54; Devstral Small 1.1 ranks dead last (53rd of 54), tied with just one other model. This is a significant finding: a model marketed for agentic software engineering performs at the floor on our goal decomposition and failure recovery test.
  • Strategic analysis: 3 vs 2 — Ministral 3 8B 2512 ranks 36th of 54; Devstral Small 1.1 ranks 44th. For nuanced tradeoff reasoning with real numbers, Ministral 3 8B 2512 holds a clear edge.

Where Devstral Small 1.1 leads:

  • Safety calibration: 2 vs 1 — Devstral Small 1.1 ranks 12th of 55 (tied with 19 others); Ministral 3 8B 2512 ranks 32nd of 55. This is Devstral Small 1.1's only outright win. Still, a score of 2 is at the median of our dataset (p50 = 2), so neither model is strong here in absolute terms.

Ties (6 benchmarks): Both models score identically on classification (4, tied for 1st among 53), tool calling (4, rank 18 of 54), structured output (4, rank 26 of 54), faithfulness (4, rank 34 of 55), long context (4, rank 38 of 55), and multilingual (4, rank 36 of 55). These are meaningful shared capabilities — both handle JSON schema compliance, function calling, source fidelity, and retrieval at 30K+ tokens at comparable levels.

One structural advantage for Ministral 3 8B 2512 not captured in our 1-5 scores: it supports image inputs (text+image->text modality) and a 262,144-token context window vs Devstral Small 1.1's text-only modality and 131,072-token context. It also supports logprobs and top_logprobs parameters, which Devstral Small 1.1 does not.

BenchmarkDevstral Small 1.1Ministral 3 8B 2512
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis2/53/5
Persona Consistency2/55/5
Constrained Rewriting3/55/5
Creative Problem Solving2/53/5
Summary1 wins5 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/Mtok input and $0.30/Mtok output. Ministral 3 8B 2512 costs $0.15/Mtok input and $0.15/Mtok output. The crossover matters at scale: at 1M output tokens/month, Devstral Small 1.1 costs $0.30 vs $0.15 for Ministral 3 8B 2512 — a $0.15 difference, negligible. At 10M output tokens, that gap widens to $1.50 vs $1.50 input-adjusted but $3.00 vs $1.50 on output — Ministral 3 8B 2512 saves $1.50/month on output alone. At 100M output tokens/month, you're paying $30 vs $15 on output — a $15/month saving with Ministral 3 8B 2512 that compounds fast in high-volume production. For input-heavy workloads (long documents, large context retrievals), Devstral Small 1.1's $0.10/Mtok input is cheaper than Ministral 3 8B 2512's $0.15/Mtok, potentially offsetting the output cost gap for read-heavy tasks. Developers running balanced I/O workloads at scale will almost always spend less with Ministral 3 8B 2512.

Real-World Cost Comparison

TaskDevstral Small 1.1Ministral 3 8B 2512
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.010
iPipeline run$0.170$0.105

Bottom Line

Choose Devstral Small 1.1 if you are building a dedicated software engineering agent pipeline, specifically one that prioritizes structured code output and JSON compliance, you're working with large codebases that fit within 131K tokens, and your team can tolerate its weak agentic planning score (last of 54 models in our testing) by compensating with orchestration scaffolding. Also consider it if your workload is input-heavy enough that its lower $0.10/Mtok input cost offsets the $0.30/Mtok output cost.

Choose Ministral 3 8B 2512 if you need a general-purpose model that handles a broad range of tasks — especially persona-consistent chatbots, constrained copywriting, strategic reasoning, or any workflow involving image inputs. Its 262K context window doubles Devstral Small 1.1's capacity for long-document workloads. At $0.15/Mtok on both input and output, it's also cheaper on output-intensive tasks, making it the more economical choice for most balanced production workloads.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions