DeepSeek V3.1 Terminus vs Devstral Medium

DeepSeek V3.1 Terminus is the better pick for long-context apps, structured outputs, and strategic reasoning — it wins 6 of 12 benchmarks including 5/5 long_context and 5/5 structured_output. Devstral Medium beats DeepSeek on classification (4/5) and faithfulness (4/5) and is the choice when routing accuracy and fidelity matter, but it costs substantially more.

deepseek

DeepSeek V3.1 Terminus

Overall
3.75/5Strong

Benchmark Scores

Faithfulness
3/5
Long Context
5/5
Multilingual
5/5
Tool Calling
3/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
4/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.210/MTok

Output

$0.790/MTok

Context Window164K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Across our 12-test suite DeepSeek V3.1 Terminus wins 6 benchmarks, Devstral Medium wins 2, and 4 tests tie. DeepSeek wins: structured_output 5 vs 4 (tied for 1st with 24 others — strong JSON/schema compliance), strategic_analysis 5 vs 2 (tied for 1st with 25 others — better at nuanced tradeoff reasoning), creative_problem_solving 4 vs 2 (rank 9/54 — better at non-obvious feasible ideas), long_context 5 vs 4 (tied for 1st with 36 others — best for retrieval and context >30K tokens), persona_consistency 4 vs 3 (better at staying in-character), and multilingual 5 vs 4 (tied for 1st with 34 others — stronger non-English parity). Devstral wins classification 4 vs 3 (tied for 1st with 29 others — best at accurate routing/categorization) and faithfulness 4 vs 3 (Devstral rank 34/55 vs DeepSeek rank 52/55 — Devstral is measurably better at sticking to source material and avoiding hallucination). Ties: constrained_rewriting 3/3 (equal on hard character limits), tool_calling 3/3 (both moderate at function selection and sequencing, ranks 47/54), safety_calibration 1/1 (both low on refusing harmful requests, rank 32/55), and agentic_planning 4/4 (both competent at goal decomposition, rank 16/54). Practically: choose DeepSeek when your app needs long documents, strict JSON outputs, multilingual coverage, or higher-level strategic reasoning; choose Devstral when you need top-tier classification and better faithfulness for production routing or content fidelity.

BenchmarkDeepSeek V3.1 TerminusDevstral Medium
Faithfulness3/54/5
Long Context5/54/5
Multilingual5/54/5
Tool Calling3/53/5
Classification3/54/5
Agentic Planning4/54/5
Structured Output5/54/5
Safety Calibration1/51/5
Strategic Analysis5/52/5
Persona Consistency4/53/5
Constrained Rewriting3/53/5
Creative Problem Solving4/52/5
Summary6 wins2 wins

Pricing Analysis

Per-million-token rates: DeepSeek V3.1 Terminus charges $0.21 (input) and $0.79 (output) per M-token; Devstral Medium charges $0.40 (input) and $2.00 (output) per M-token. If you assume 50% input / 50% output tokens, combined cost per 1M tokens is ~$0.50 for DeepSeek vs ~$1.20 for Devstral. At scale that multiplies linearly: 10M tokens/month ≈ $5 (DeepSeek) vs $12 (Devstral); 100M ≈ $50 vs $120. The priceRatio in the payload is 0.395 (DeepSeek total cost ≈ 39.5% of Devstral under the same split). Teams doing high-volume inference (tens to hundreds of millions of tokens) should care about this gap; small experiments and low-volume development will feel the difference less. Note: output-heavy workloads increase the gap because Devstral’s output rate is $2.00/M-token vs DeepSeek’s $0.79/M-token.

Real-World Cost Comparison

TaskDeepSeek V3.1 TerminusDevstral Medium
iChat response<$0.001$0.0011
iBlog post$0.0017$0.0042
iDocument batch$0.044$0.108
iPipeline run$0.437$1.08

Bottom Line

Choose DeepSeek V3.1 Terminus if you need: long-context retrieval or summarization (>30K tokens), reliable JSON/schema outputs, strategic reasoning, multilingual support, and a much lower per-token cost. Choose Devstral Medium if you need: stronger classification and faithfulness (4/5 each in our tests), or you prioritize routing/categorization fidelity in production despite ~2–2.5x higher per-token cost (depending on I/O split). If you need both, consider using Devstral for critical classification/fidelity paths and DeepSeek for heavy long-context or high-volume output generation.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions