DeepSeek V3.1 vs Devstral Medium

DeepSeek V3.1 is the better pick for most applications: it wins 6 of 12 benchmarks in our testing and excels at long-context (5/5), faithfulness (5/5) and structured output (5/5) while being much cheaper. Devstral Medium only wins classification (4/5) and offers a larger 131,072-token context window but comes at substantially higher cost ($0.40/$2 per 1k tokens).

deepseek

DeepSeek V3.1

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
3/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.750/MTok

Context Window33K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

We ran both models across our 12-test suite; scores are on a 1–5 scale and rankings reference our 52–55 model pool. Test-by-test (score A = DeepSeek V3.1, score B = Devstral Medium):

  • faithfulness: A 5 vs B 4 — DeepSeek wins; ranks tied for 1st with 32 others out of 55, indicating top-tier source fidelity (sticking to input material).
  • constrained_rewriting: A 3 vs B 3 — tie; both rank ~31/53. This suggests similar behavior when compressing within tight character limits.
  • safety_calibration: A 1 vs B 1 — tie; both low-ranked (32/55), so neither model is strong at safely refusing harmful prompts in our tests.
  • tool_calling: A 3 vs B 3 — tie; both rank 47/54, so function-selection and argument accuracy are middle-to-low compared with the field.
  • structured_output: A 5 vs B 4 — DeepSeek wins; A is tied for 1st (with 24 others of 54), meaning much stronger JSON/schema compliance in our tests.
  • agentic_planning: A 4 vs B 4 — tie; both rank 16/54, indicating similar goal decomposition and recovery abilities.
  • multilingual: A 4 vs B 4 — tie; both rank 36/55, showing comparable non-English quality in our sampling.
  • classification: A 3 vs B 4 — Devstral wins; B is tied for 1st with 29 others out of 53, so Devstral is the better model for routing/categorization tasks in our suite.
  • long_context: A 5 vs B 4 — DeepSeek wins; A tied for 1st with 36 others (out of 55) despite its 32K window vs Devstral's 131K window—this means DeepSeek performed better on retrieval/accuracy at long contexts in our tests.
  • persona_consistency: A 5 vs B 3 — DeepSeek wins; A tied for 1st with 36 others (out of 53), so it better maintains characters and resists injection in our evaluation.
  • strategic_analysis: A 4 vs B 2 — DeepSeek wins; A ranks 27/54, showing stronger nuanced tradeoff reasoning for number-driven decisions.
  • creative_problem_solving: A 5 vs B 2 — DeepSeek wins; A tied for 1st with 7 others (out of 54), meaning it consistently produced more non-obvious, feasible ideas in our tests. Overall: DeepSeek wins 6 tests (structured_output, strategic_analysis, creative_problem_solving, faithfulness, long_context, persona_consistency), Devstral wins 1 test (classification), and 5 tests tie (constrained_rewriting, tool_calling, safety_calibration, agentic_planning, multilingual). Rankings show DeepSeek is top-tier for schema adherence, long-context behavior, and faithfulness; Devstral is strongest for classification in our benchmark set.
BenchmarkDeepSeek V3.1Devstral Medium
Faithfulness5/54/5
Long Context5/54/5
Multilingual4/54/5
Tool Calling3/53/5
Classification3/54/5
Agentic Planning4/54/5
Structured Output5/54/5
Safety Calibration1/51/5
Strategic Analysis4/52/5
Persona Consistency5/53/5
Constrained Rewriting3/53/5
Creative Problem Solving5/52/5
Summary6 wins1 wins

Pricing Analysis

Prices (per 1k tokens): DeepSeek V3.1 input $0.15, output $0.75; Devstral Medium input $0.40, output $2.00. Assuming a 50/50 input/output token split: for 1M tokens/month (1,000 mTok) DeepSeek costs $450 vs Devstral $1,200; for 10M tokens DeepSeek $4,500 vs Devstral $12,000; for 100M tokens DeepSeek $45,000 vs Devstral $120,000. DeepSeek runs at 37.5% of Devstral's cost (priceRatio 0.375) under this split—so high-volume products, cost-sensitive deployments, and SaaS apps should care deeply about the gap. If you only need small-scale experimentation or classification-heavy workloads, the higher Devstral price may still be acceptable; for production throughput or heavy output use, DeepSeek is far more cost-efficient.

Real-World Cost Comparison

TaskDeepSeek V3.1Devstral Medium
iChat response<$0.001$0.0011
iBlog post$0.0016$0.0042
iDocument batch$0.041$0.108
iPipeline run$0.405$1.08

Bottom Line

Choose DeepSeek V3.1 if you need reliable long-context retrieval, strict JSON/schema output, high faithfulness, persona consistency, or creative problem solving at much lower cost—examples: document retrieval and structured-extraction pipelines, production chatbots that must follow a schema, or high-volume generative workloads. Choose Devstral Medium if your primary need is top-tier classification/routing and you require a very large context window (131,072 tokens) and can absorb higher costs—examples: specialized classifier endpoints, experiments that need extreme context length and are low-volume or where classification accuracy is the priority.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions