Devstral Small 1.1 vs Ministral 3 3B 2512

For most production use cases pick Ministral 3 3B 2512: it wins 5 of 12 benchmarks in our testing and costs $0.1 vs $0.3 per output mTok. Choose Devstral Small 1.1 only if safety calibration is your primary requirement — it scores higher there — and you accept the roughly 3x output cost.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Ministral 3 3B 2512

Overall
3.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
4/5
Constrained Rewriting
5/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.100/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

We ran a 12-test suite and compared per-task scores and ranked positions — all statements below are from our testing. Ministral 3 3B 2512 wins five tests: faithfulness (B 5 vs A 4; B is tied for 1st with 32 others out of 55 tested), constrained rewriting (B 5 vs A 3; B tied for 1st with 4 others), creative problem solving (B 3 vs A 2; B ranks 30 of 54), persona consistency (B 4 vs A 2; B ranks 38 of 53 while Devstral ranks 51 of 53), and agentic planning (B 3 vs A 2; B ranks 42 of 54 vs Devstral's rank 53 of 54). Devstral Small 1.1 wins safety calibration (A 2 vs B 1; Devstral ranks 12 of 55 with many tied). Six tests are ties: structured output (4/4; both rank 26 of 54), strategic analysis (2/2; both rank 44 of 54), tool calling (4/4; both rank 18 of 54), classification (4/4; both tied for 1st with 29 others), long context (4/4; both rank 38 of 55), and multilingual (4/4; both rank 36 of 55). Practically: Ministral's higher faithfulness and constrained rewriting scores mean better adherence to source material and tighter compression into strict character limits; its persona consistency and agentic planning advantages translate to fewer role-injection failures and stronger goal decomposition in our tests. Devstral's safety calibration edge means it more often refuses harmful requests appropriately in our evaluation, which matters in regulated or high-risk deployments. For tool workflows, classification, long context, and structured-output work, the two models are effectively tied in our testing.

BenchmarkDevstral Small 1.1Ministral 3 3B 2512
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis2/52/5
Persona Consistency2/54/5
Constrained Rewriting3/55/5
Creative Problem Solving2/53/5
Summary1 wins5 wins

Pricing Analysis

Both models charge $0.1 per input mTok. Output cost differs: Ministral 3 3B 2512 is $0.1/mTok, Devstral Small 1.1 is $0.3/mTok (price ratio ≈ 3x). At 1M tokens/month (1,000 mTok) with a 50/50 input/output split, Ministral costs $100/month (0.1*1000), Devstral costs $200/month — a $100 difference. At 10M tokens (10,000 mTok) with 50/50 split, Ministral ≈ $1,000 vs Devstral ≈ $2,000 — $1,000 gap. At 100M tokens, Ministral ≈ $10,000 vs Devstral ≈ $20,000 — $10,000 gap. If your workload is output-heavy (e.g., 80% output), the gap grows (1M tokens: $100 vs $260). High-volume text generation, chat, or multi-tenant APIs should care about this delta; small-scale experiments will feel it less.

Real-World Cost Comparison

TaskDevstral Small 1.1Ministral 3 3B 2512
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.0070
iPipeline run$0.170$0.070

Bottom Line

Choose Ministral 3 3B 2512 if you need the lowest-cost production model with stronger faithfulness, constrained rewriting, and better persona consistency and planning in our testing — and if you want multimodal input (text+image->text). Choose Devstral Small 1.1 if safety calibration is a top priority and you are willing to pay roughly 3x for output tokens ($0.3 vs $0.1 per output mTok). If you depend primarily on tool calling, classification, long-context retrieval, or structured-output, either model performs similarly in our benchmarks.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions