Devstral Small 1.1 vs Mistral Small 3.1 24B
There is no single majority winner: Devstral Small 1.1 is the better pick for agentic tooling, classification, and safety-sensitive pipelines; Mistral Small 3.1 24B is stronger for very long-context retrieval and strategic reasoning. Devstral also delivers a large cost advantage, while Mistral offers multimodal input and top-ranked long-context behavior.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Mistral Small 3.1 24B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.350/MTok
Output
$0.560/MTok
modelpicker.net
Benchmark Analysis
We ran both models across our 12-test suite and get a split outcome. Summary (Devstral vs Mistral, score on our 1–5 scale):
- Classification: 4 vs 3 — Devstral wins. Devstral is tied for 1st of 53 models on classification (tied with 29 others), meaning better routing and categorization in pipelines.
- Tool_calling: 4 vs 1 — Devstral wins. Devstral ranks 18 of 54; Mistral ranks 53 of 54 (no_tool_calling quirk). This matters for function selection and argument accuracy in agentic workflows.
- Safety_calibration: 2 vs 1 — Devstral wins (rank 12 of 55 vs Mistral rank 32). Devstral refuses more harmful requests and permits more legitimate ones in our tests.
- Long_context: 4 vs 5 — Mistral wins and is tied for 1st of 55 models (tied with 36 others). This indicates Mistral is stronger at retrieval and reasoning over 30K+ token documents.
- Strategic_analysis: 2 vs 3 — Mistral wins (Devstral rank 44 vs Mistral rank 36). Mistral produces comparatively better nuanced tradeoff reasoning in our tests.
- Agentic_planning: 2 vs 3 — Mistral wins (Devstral rank 53 of 54 vs Mistral rank 42). Mistral decomposes goals and recovery paths more effectively in our scenarios.
- Ties (no clear winner): structured_output 4/4 (both rank 26 of 54), constrained_rewriting 3/3 (both rank 31), creative_problem_solving 2/2 (both rank 47), faithfulness 4/4 (both rank 34), persona_consistency 2/2 (both rank 51), multilingual 4/4 (both rank 36). These ties show similar behavior on schema adherence, compression, hallucination resistance, persona, and non-English output. Interpretation: Devstral is the pragmatic choice for AI agents that must call tools, classify inputs, and maintain conservative safety behavior at lower cost. Mistral is preferable for applications that need maximal long-context retrieval and stronger strategic/agentic planning, and it accepts multimodal inputs (text+image->text).
Pricing Analysis
Costs below assume a 50/50 split of input vs output tokens (state the assumption). Devstral Small 1.1: input $0.10/mTok, output $0.30/mTok. At 1M total tokens (0.5M input + 0.5M output => 500 mTok each) cost = $0.10500 + $0.30500 = $200. At 10M = $2,000. At 100M = $20,000. Mistral Small 3.1 24B: input $0.35/mTok, output $0.56/mTok. At 1M (50/50) = $0.35500 + $0.56500 = $455. At 10M = $4,550. At 100M = $45,500. Savings: Devstral saves $255 per 1M tokens (50/50), $2,550 per 10M, $25,500 per 100M. High-volume API customers and cost-sensitive production deployments should care most; for small-scale experimentation the quality differences may outweigh cost.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if you: need reliable tool calling and function selection, prioritize classification accuracy and stricter safety calibration, run high-volume API workloads and want lower costs (saves ~$255 per 1M tokens at a 50/50 split). Choose Mistral Small 3.1 24B if you: must reason over very long contexts (tied for 1st on long_context), require better strategic analysis or agentic planning in our tests, or need multimodal (text+image) input support despite higher per-token costs.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.