Devstral Small 1.1 vs Mistral Medium 3.1

For most production use cases that prioritize capability—long-context retrieval, multilingual support, agentic planning—Mistral Medium 3.1 is the winner across the majority of our 12 tests. Devstral Small 1.1 is the budget choice: it ties on structured output, tool calling, classification and safety, but costs substantially less per token.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Mistral Medium 3.1

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
5/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Overview: In our 12-test suite Devstral Small 1.1 (A) wins 0 tests, Mistral Medium 3.1 (B) wins 7 tests, and 5 tests tie. Specifics: - Strategic analysis: A=2 vs B=5 — Mistral wins by 3 points and ranks tied for 1st on strategic analysis in our rankings (tied with 25 others). This indicates Mistral is measurably better at nuanced tradeoff reasoning. - Constrained rewriting: A=3 vs B=5 — Mistral wins and is tied for 1st in constrained rewriting, so it handles tight compression and hard limits better. - Creative problem solving: A=2 vs B=3 — Mistral wins (rank 30 of 54) for non-obvious, feasible ideas. - Long context: A=4 vs B=5 — Mistral wins and is tied for 1st on long context (tied with 36 others), meaning it performs best at retrieval/accuracy across 30K+ tokens. - Persona consistency: A=2 vs B=5 — Mistral wins and is tied for 1st; it’s more resistant to injection and better at maintaining character. - Agentic planning: A=2 vs B=5 — Mistral wins and is tied for 1st, so it decomposes goals and recovery steps more reliably. - Multilingual: A=4 vs B=5 — Mistral wins and is tied for 1st on multilingual, so equivalent non-English quality favors Mistral. Ties (no clear winner): structured output (4 vs 4), tool calling (4 vs 4), faithfulness (4 vs 4), classification (4 vs 4), safety calibration (2 vs 2). Practical meaning: for structured JSON outputs, function selection, faithfulness to source text, classification routing, and safety calibration both models perform similarly in our tests. Devstral’s strengths vs Mistral: none of the tests show a pure win for Devstral in our suite; its description names it as targeted at software engineering agents, and it shares parity on tool calling and classification which are critical for coding assistants. Both models share a 131,072-token context window in the payload.

BenchmarkDevstral Small 1.1Mistral Medium 3.1
Faithfulness4/54/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/55/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/55/5
Creative Problem Solving2/53/5
Summary0 wins7 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 costs $0.10 per mTok input and $0.30 per mTok output; Mistral Medium 3.1 costs $0.40 per mTok input and $2.00 per mTok output. Assuming a 50/50 split of input vs output tokens, total cost per 1M tokens is about $200 for Devstral (0.1500 + 0.3500) and $1,200 for Mistral Medium (0.4500 + 2.0500). Scaling: for 10M tokens/month expect ~$2,000 vs ~$12,000; for 100M tokens/month expect ~$20,000 vs ~$120,000. The payload also reports priceRatio = 0.15 indicating Devstral is a small fraction of Mistral Medium’s price in our dataset. Teams with high-volume production traffic or tight budgets should care most about this gap; teams needing the highest capability across long contexts and multilingual flows may justify the higher spend on Mistral Medium 3.1.

Real-World Cost Comparison

TaskDevstral Small 1.1Mistral Medium 3.1
iChat response<$0.001$0.0011
iBlog post<$0.001$0.0042
iDocument batch$0.017$0.108
iPipeline run$0.170$1.08

Bottom Line

Choose Devstral Small 1.1 if: you need a cost-efficient model for high-volume deployments, want parity on structured output, tool calling and classification at a much lower price (input $0.10 / output $0.30 per mTok), or are building a software-engineering-focused agent (Devstral’s description targets SE agents). Choose Mistral Medium 3.1 if: you require stronger multilingual performance, robust long-context retrieval, agentic planning, persona consistency, or constrained rewriting (scores 5 vs 2 in these tests) and can justify higher operating cost (input $0.40 / output $2.00 per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions