Devstral Small 1.1 vs Mistral Medium 3.1
For most production use cases that prioritize capability—long-context retrieval, multilingual support, agentic planning—Mistral Medium 3.1 is the winner across the majority of our 12 tests. Devstral Small 1.1 is the budget choice: it ties on structured output, tool calling, classification and safety, but costs substantially less per token.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Mistral Medium 3.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
Overview: In our 12-test suite Devstral Small 1.1 (A) wins 0 tests, Mistral Medium 3.1 (B) wins 7 tests, and 5 tests tie. Specifics: - Strategic analysis: A=2 vs B=5 — Mistral wins by 3 points and ranks tied for 1st on strategic analysis in our rankings (tied with 25 others). This indicates Mistral is measurably better at nuanced tradeoff reasoning. - Constrained rewriting: A=3 vs B=5 — Mistral wins and is tied for 1st in constrained rewriting, so it handles tight compression and hard limits better. - Creative problem solving: A=2 vs B=3 — Mistral wins (rank 30 of 54) for non-obvious, feasible ideas. - Long context: A=4 vs B=5 — Mistral wins and is tied for 1st on long context (tied with 36 others), meaning it performs best at retrieval/accuracy across 30K+ tokens. - Persona consistency: A=2 vs B=5 — Mistral wins and is tied for 1st; it’s more resistant to injection and better at maintaining character. - Agentic planning: A=2 vs B=5 — Mistral wins and is tied for 1st, so it decomposes goals and recovery steps more reliably. - Multilingual: A=4 vs B=5 — Mistral wins and is tied for 1st on multilingual, so equivalent non-English quality favors Mistral. Ties (no clear winner): structured output (4 vs 4), tool calling (4 vs 4), faithfulness (4 vs 4), classification (4 vs 4), safety calibration (2 vs 2). Practical meaning: for structured JSON outputs, function selection, faithfulness to source text, classification routing, and safety calibration both models perform similarly in our tests. Devstral’s strengths vs Mistral: none of the tests show a pure win for Devstral in our suite; its description names it as targeted at software engineering agents, and it shares parity on tool calling and classification which are critical for coding assistants. Both models share a 131,072-token context window in the payload.
Pricing Analysis
Per the payload, Devstral Small 1.1 costs $0.10 per mTok input and $0.30 per mTok output; Mistral Medium 3.1 costs $0.40 per mTok input and $2.00 per mTok output. Assuming a 50/50 split of input vs output tokens, total cost per 1M tokens is about $200 for Devstral (0.1500 + 0.3500) and $1,200 for Mistral Medium (0.4500 + 2.0500). Scaling: for 10M tokens/month expect ~$2,000 vs ~$12,000; for 100M tokens/month expect ~$20,000 vs ~$120,000. The payload also reports priceRatio = 0.15 indicating Devstral is a small fraction of Mistral Medium’s price in our dataset. Teams with high-volume production traffic or tight budgets should care most about this gap; teams needing the highest capability across long contexts and multilingual flows may justify the higher spend on Mistral Medium 3.1.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you need a cost-efficient model for high-volume deployments, want parity on structured output, tool calling and classification at a much lower price (input $0.10 / output $0.30 per mTok), or are building a software-engineering-focused agent (Devstral’s description targets SE agents). Choose Mistral Medium 3.1 if: you require stronger multilingual performance, robust long-context retrieval, agentic planning, persona consistency, or constrained rewriting (scores 5 vs 2 in these tests) and can justify higher operating cost (input $0.40 / output $2.00 per mTok).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.