Devstral Medium vs Devstral Small 1.1

In our testing Devstral Small 1.1 is the practical default: it ties or matches Devstral Medium on 8 of 12 benchmarks and wins tool_calling (4 vs 3) and safety_calibration (2 vs 1) while costing far less. Devstral Medium wins agentic_planning (4 vs 2) and persona_consistency (3 vs 2), so choose it when agentic reasoning or stronger persona maintenance justify the price premium.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Summary of our 12-test comparison (scores are our 1–5 internal scores; ranks are from the payload):

  • Devstral Medium wins (2): persona_consistency 3 vs 2 (Medium rank 45/53, Small rank 51/53) — in our testing Medium keeps character and resists injection better; agentic_planning 4 vs 2 (Medium rank 16/54, Small rank 53/54) — meaning Medium is notably better at goal decomposition and recovery for agentic flows.
  • Devstral Small 1.1 wins (2): tool_calling 4 vs 3 (Small rank 18/54, Medium rank 47/54) — Small selects functions and arguments more accurately in our tests; safety_calibration 2 vs 1 (Small rank 12/55, Medium rank 32/55) — Small refused harmful requests more reliably while permitting legitimate ones.
  • Ties (8): structured_output 4/4 (both rank 26/54), strategic_analysis 2/2 (both rank 44/54), constrained_rewriting 3/3 (both rank 31/53), creative_problem_solving 2/2 (both rank 47/54), faithfulness 4/4 (both rank 34/55), classification 4/4 (tied for 1st with 29 others), long_context 4/4 (both rank 38/55), multilingual 4/4 (both rank 36/55). For those tied tasks, expect similar real-world behaviour: JSON schema adherence, long-context retrieval, classification accuracy, and multilingual output are comparable between the models in our tests. Context: Devstral Medium’s advantages are concentrated in agentic planning and persona maintenance; Devstral Small 1.1’s advantages are in tool selection and safety. For most content, they match, so cost becomes the primary differentiator for high-volume use.
BenchmarkDevstral MediumDevstral Small 1.1
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/52/5
Structured Output4/54/5
Safety Calibration1/52/5
Strategic Analysis2/52/5
Persona Consistency3/52/5
Constrained Rewriting3/53/5
Creative Problem Solving2/52/5
Summary2 wins2 wins

Pricing Analysis

Pricing in the payload is per mtok values; assuming per-mtok = per 1,000 tokens and a 50/50 input/output split, the monthly costs are: Devstral Medium (input $0.4, output $2.0 per 1k): 1M tokens ≈ $1,200; 10M ≈ $12,000; 100M ≈ $120,000. Devstral Small 1.1 (input $0.1, output $0.3 per 1k): 1M tokens ≈ $200; 10M ≈ $2,000; 100M ≈ $20,000. The output price ratio is 6.67× (2.0 / 0.3), matching the payload priceRatio. Who should care: high-volume API customers, SaaS products, and startups will see meaningful savings with Devstral Small 1.1 at scale; teams that run many agentic workflows or need stronger persona consistency may accept Devstral Medium’s higher cost for its wins.

Real-World Cost Comparison

TaskDevstral MediumDevstral Small 1.1
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.017
iPipeline run$1.08$0.170

Bottom Line

Choose Devstral Medium if: you need stronger agentic planning (score 4 vs 2) or better persona consistency (3 vs 2) for agent frameworks, autonomous chains, or situations where goal decomposition and persona fidelity are mission-critical — and you can absorb the higher cost. Choose Devstral Small 1.1 if: you want similar performance on classification, long-context, structured output, and multilingual tasks while saving substantially on cost (1M tokens at a 50/50 split: ≈ $200 vs $1,200), or if tool calling and safety calibration (scores 4 and 2) are your priorities.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions