Devstral Small 1.1 vs Mistral Small 3.1 24B

There is no single majority winner: Devstral Small 1.1 is the better pick for agentic tooling, classification, and safety-sensitive pipelines; Mistral Small 3.1 24B is stronger for very long-context retrieval and strategic reasoning. Devstral also delivers a large cost advantage, while Mistral offers multimodal input and top-ranked long-context behavior.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Mistral Small 3.1 24B

Overall
2.92/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
4/5
Tool Calling
1/5
Classification
3/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
3/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.350/MTok

Output

$0.560/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

We ran both models across our 12-test suite and get a split outcome. Summary (Devstral vs Mistral, score on our 1–5 scale):

  • Classification: 4 vs 3 — Devstral wins. Devstral is tied for 1st of 53 models on classification (tied with 29 others), meaning better routing and categorization in pipelines.
  • Tool_calling: 4 vs 1 — Devstral wins. Devstral ranks 18 of 54; Mistral ranks 53 of 54 (no_tool_calling quirk). This matters for function selection and argument accuracy in agentic workflows.
  • Safety_calibration: 2 vs 1 — Devstral wins (rank 12 of 55 vs Mistral rank 32). Devstral refuses more harmful requests and permits more legitimate ones in our tests.
  • Long_context: 4 vs 5 — Mistral wins and is tied for 1st of 55 models (tied with 36 others). This indicates Mistral is stronger at retrieval and reasoning over 30K+ token documents.
  • Strategic_analysis: 2 vs 3 — Mistral wins (Devstral rank 44 vs Mistral rank 36). Mistral produces comparatively better nuanced tradeoff reasoning in our tests.
  • Agentic_planning: 2 vs 3 — Mistral wins (Devstral rank 53 of 54 vs Mistral rank 42). Mistral decomposes goals and recovery paths more effectively in our scenarios.
  • Ties (no clear winner): structured_output 4/4 (both rank 26 of 54), constrained_rewriting 3/3 (both rank 31), creative_problem_solving 2/2 (both rank 47), faithfulness 4/4 (both rank 34), persona_consistency 2/2 (both rank 51), multilingual 4/4 (both rank 36). These ties show similar behavior on schema adherence, compression, hallucination resistance, persona, and non-English output. Interpretation: Devstral is the pragmatic choice for AI agents that must call tools, classify inputs, and maintain conservative safety behavior at lower cost. Mistral is preferable for applications that need maximal long-context retrieval and stronger strategic/agentic planning, and it accepts multimodal inputs (text+image->text).
BenchmarkDevstral Small 1.1Mistral Small 3.1 24B
Faithfulness4/54/5
Long Context4/55/5
Multilingual4/54/5
Tool Calling4/51/5
Classification4/53/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/51/5
Strategic Analysis2/53/5
Persona Consistency2/52/5
Constrained Rewriting3/53/5
Creative Problem Solving2/52/5
Summary3 wins3 wins

Pricing Analysis

Costs below assume a 50/50 split of input vs output tokens (state the assumption). Devstral Small 1.1: input $0.10/mTok, output $0.30/mTok. At 1M total tokens (0.5M input + 0.5M output => 500 mTok each) cost = $0.10500 + $0.30500 = $200. At 10M = $2,000. At 100M = $20,000. Mistral Small 3.1 24B: input $0.35/mTok, output $0.56/mTok. At 1M (50/50) = $0.35500 + $0.56500 = $455. At 10M = $4,550. At 100M = $45,500. Savings: Devstral saves $255 per 1M tokens (50/50), $2,550 per 10M, $25,500 per 100M. High-volume API customers and cost-sensitive production deployments should care most; for small-scale experimentation the quality differences may outweigh cost.

Real-World Cost Comparison

TaskDevstral Small 1.1Mistral Small 3.1 24B
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0013
iDocument batch$0.017$0.035
iPipeline run$0.170$0.350

Bottom Line

Choose Devstral Small 1.1 if you: need reliable tool calling and function selection, prioritize classification accuracy and stricter safety calibration, run high-volume API workloads and want lower costs (saves ~$255 per 1M tokens at a 50/50 split). Choose Mistral Small 3.1 24B if you: must reason over very long contexts (tied for 1st on long_context), require better strategic analysis or agentic planning in our tests, or need multimodal (text+image) input support despite higher per-token costs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions