DeepSeek V3.1 vs Devstral Small 1.1

For quality-first applications (structured outputs, long-context retrieval, faithful summaries), DeepSeek V3.1 is the better pick—it wins 7 of 12 benchmarks in our testing. Devstral Small 1.1 is the pragmatic choice when cost and function-calling/classification matter, trading lower accuracy on creative and persona tasks for ~2.5x lower output price.

deepseek

DeepSeek V3.1

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
3/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.750/MTok

Context Window33K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

We ran both models across our 12-test suite and report results as 1–5 scores (our testing). Summary (DeepSeek A vs Devstral B):

  • Faithfulness: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 32 others out of 55 on faithfulness, meaning it sticks to source material more reliably in our tests.
  • Constrained rewriting: A 3 vs B 3 — tie; both rank 31 of 53 (22 models share this score), so neither is especially strong at tight compression limits.
  • Safety calibration: A 1 vs B 2 — Devstral wins; Devstral ranks 12 of 55 (20 models share that score) versus DeepSeek rank 32, so Devstral is more likely to refuse or permit correctly in our safety scenarios.
  • Tool calling: A 3 vs B 4 — Devstral wins and ranks 18 of 54 (tied with many), while DeepSeek ranks 47 of 54; in practice Devstral is better at function selection, arguments and sequencing in our tool-calling tests.
  • Structured output: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 24 others out of 54, indicating superior JSON/schema adherence in our format-compliance tests.
  • Agentic planning: A 4 vs B 2 — DeepSeek wins (rank 16 of 54 vs Devstral rank 53), so goal decomposition and recovery behaved better in our tests for DeepSeek.
  • Multilingual: A 4 vs B 4 — tie; both rank similarly (DeepSeek rank 36/55, Devstral rank 36/55), so non-English parity is equivalent in our suite.
  • Classification: A 3 vs B 4 — Devstral wins and is tied for 1st with 29 others out of 53, making it better for routing and categorization in our tests.
  • Long-context: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 36 others out of 55, despite DeepSeek's 32K context vs Devstral's 131K window in the payload; in our retrieval/accuracy tests DeepSeek handled long-context tasks more accurately.
  • Persona consistency: A 5 vs B 2 — DeepSeek wins and is tied for 1st with 36 others out of 53, showing stronger resistance to injection and character drift in our tests.
  • Strategic analysis: A 4 vs B 2 — DeepSeek wins (rank 27/54) and produced better nuanced tradeoff reasoning with real numbers in our scenarios.
  • Creative problem solving: A 5 vs B 2 — DeepSeek wins and is tied for 1st with 7 others out of 54, delivering more non-obvious feasible ideas in our tasks. Overall, DeepSeek wins 7 categories (structured_output, strategic_analysis, creative_problem_solving, faithfulness, long_context, persona_consistency, agentic_planning). Devstral wins 3 (tool_calling, classification, safety_calibration). Two are ties (constrained_rewriting, multilingual). These differences map to concrete behaviors: choose DeepSeek when you need schema fidelity, deep reasoning, creativity and persona retention; choose Devstral when you need cheaper inference, stronger classification, and more reliable tool selection in our tests.
BenchmarkDeepSeek V3.1Devstral Small 1.1
Faithfulness5/54/5
Long Context5/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification3/54/5
Agentic Planning4/52/5
Structured Output5/54/5
Safety Calibration1/52/5
Strategic Analysis4/52/5
Persona Consistency5/52/5
Constrained Rewriting3/53/5
Creative Problem Solving5/52/5
Summary7 wins3 wins

Pricing Analysis

Per the payload, DeepSeek V3.1 charges $0.15/mTok input and $0.75/mTok output; Devstral Small 1.1 charges $0.10/mTok input and $0.30/mTok output. Example monthly costs (mTok = 1,000 tokens):

  • Balanced 50/50 input/output at 1M tokens: DeepSeek = $450 (input $150 + output $300); Devstral = $200 (input $100 + output $100). Gap = $250/month.
  • At 10M tokens (50/50): DeepSeek = $4,500; Devstral = $2,000. Gap = $2,500/month.
  • At 100M tokens (50/50): DeepSeek = $45,000; Devstral = $20,000. Gap = $25,000/month. If usage is output-heavy (e.g., long generated responses), the output-rate difference ($0.75 vs $0.30/mTok) dominates costs; at 1M output-only tokens DeepSeek = $750 vs Devstral = $300. Teams with high-volume production apps, chat services with long replies, or tight budgets should care about this gap; proof-of-concept, developer experimentation, and lower-volume services will find Devstral materially cheaper.

Real-World Cost Comparison

TaskDeepSeek V3.1Devstral Small 1.1
iChat response<$0.001<$0.001
iBlog post$0.0016<$0.001
iDocument batch$0.041$0.017
iPipeline run$0.405$0.170

Bottom Line

Choose DeepSeek V3.1 if you need: high-fidelity outputs, robust long-context retrieval, strict structured output (5/5), creative problem solving (5/5), persona consistency (5/5), and stronger agentic planning — in our tests it wins 7 of 12 benchmarks. Choose Devstral Small 1.1 if you need: lower cost (output $0.30 vs $0.75/mTok), better tool calling (4 vs 3) and classification (4 vs 3), or are shipping high-volume production where the 2.5x price ratio matters. If your product is cost-sensitive and depends on function-calling/labeling, pick Devstral; if quality, faithfulness and complex reasoning drive business value, pick DeepSeek.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions