Devstral Small 1.1 vs Mistral Small 4

Mistral Small 4 is the stronger general-purpose model, winning 6 of 12 benchmarks in our testing — including agentic planning (4 vs 2), strategic analysis (4 vs 2), and creative problem solving (4 vs 2) — while Devstral Small 1.1 wins only on classification. Devstral Small 1.1 does have one meaningful edge: it costs half as much on output tokens ($0.30/M vs $0.60/M), making it worth considering for high-volume classification pipelines where that single strength matters most. For everything else — reasoning, planning, persona work, multilingual tasks — Mistral Small 4 is the clear choice.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Mistral Small 4

Overall
3.83/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Mistral Small 4 wins 6 of 12 benchmarks in our testing; Devstral Small 1.1 wins 1; they tie on 5.

Where Mistral Small 4 wins:

  • Structured output: 5 vs 4 — Mistral Small 4 ties for 1st among 54 models tested; Devstral ranks 26th. For JSON schema compliance and format adherence, the gap matters for production API integrations.
  • Persona consistency: 5 vs 2 — Mistral Small 4 ties for 1st among 53 models; Devstral ranks 51st of 53, nearly last. Devstral is a poor choice for any chatbot or character-maintaining application.
  • Multilingual: 5 vs 4 — Mistral Small 4 ties for 1st among 55 models; Devstral ranks 36th. Both are above the median (p50 = 5 for top models), but the gap is real for non-English tasks.
  • Strategic analysis: 4 vs 2 — Mistral Small 4 ranks 27th of 54; Devstral ranks 44th. Nuanced tradeoff reasoning with real data is substantially weaker in Devstral.
  • Creative problem solving: 4 vs 2 — Mistral Small 4 ranks 9th of 54; Devstral ranks 47th. Nearly bottom-of-field for Devstral on generating non-obvious, feasible ideas.
  • Agentic planning: 4 vs 2 — Mistral Small 4 ranks 16th of 54; Devstral ranks last (53rd of 54, tied with one other model). This is a critical weakness: Devstral should not be used for goal decomposition or autonomous agent workflows.

Where Devstral Small 1.1 wins:

  • Classification: 4 vs 2 — Devstral ties for 1st among 53 models; Mistral Small 4 ranks 51st of 53, nearly last. This is a complete reversal — for routing and categorization tasks, Devstral has a decisive advantage.

Ties (both models identical):

  • Tool calling: both score 4, both rank 18th of 54 — solid but not elite for function calling and argument accuracy.
  • Faithfulness: both score 4, both rank 34th of 55 — adequate source adherence.
  • Long context: both score 4, both rank 38th of 55 — competent at 30K+ token retrieval but not top-tier.
  • Constrained rewriting: both score 3, both rank 31st of 53 — below the field median (p50 = 4).
  • Safety calibration: both score 2, both rank 12th of 55 (tied with 20 models) — identical, and above the field median (p50 = 2).

Note: Neither model has external benchmark scores (SWE-bench Verified, AIME 2025, MATH Level 5) in this dataset, so internal scores are the primary evidence for this comparison. Devstral Small 1.1's description indicates it was fine-tuned for software engineering agents, but no external coding benchmark data is available to quantify that claim here.

BenchmarkDevstral Small 1.1Mistral Small 4
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/52/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/54/5
Persona Consistency2/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/54/5
Summary1 wins6 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/M input and $0.30/M output. Mistral Small 4 costs $0.15/M input and $0.60/M output — 50% more on input and double on output. At 1M output tokens/month, that's $0.30 vs $0.60: a negligible $3.60/year difference. At 10M output tokens/month, the gap grows to $3,000/year ($3,000 vs $6,000). At 100M output tokens/month, you're looking at $30,000 vs $60,000 — a $30,000/year saving with Devstral Small 1.1. The cost gap only becomes meaningful at scale, and only justifies choosing Devstral Small 1.1 if your use case maps to its strengths (primarily classification). Developers running high-volume classification or routing pipelines at 10M+ tokens/month have a real reason to care about the price difference. For general workloads where Mistral Small 4 performs substantially better, paying double on output is likely worthwhile.

Real-World Cost Comparison

TaskDevstral Small 1.1Mistral Small 4
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0013
iDocument batch$0.017$0.033
iPipeline run$0.170$0.330

Bottom Line

Choose Devstral Small 1.1 if your primary workload is classification, routing, or categorization — it scores 4 vs Mistral Small 4's 2 on our classification benchmark and ties for 1st among 53 models tested. At 10M+ output tokens/month, the $0.30/M vs $0.60/M output cost difference also becomes meaningful. Do not use it for agentic workflows (ranks last at 53rd of 54), persona-driven applications (51st of 53), strategic analysis (44th of 54), or creative tasks (47th of 54).

Choose Mistral Small 4 if you need a capable general-purpose model — it wins 6 of 12 benchmarks and handles agentic planning, structured output, persona consistency, multilingual tasks, strategic analysis, and creative problem solving substantially better. It also supports image input (text+image->text modality) and a 262,144-token context window versus Devstral's 131,072 tokens, giving it more flexibility for multimodal and long-document workloads. The 2× output cost premium is justified for most general use cases.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions