DeepSeek V3.1 Terminus vs Devstral Small 1.1
DeepSeek V3.1 Terminus is the stronger general-purpose model, winning 7 of 12 benchmarks in our testing — including decisive advantages on strategic analysis (5 vs 2), long context (5 vs 4), structured output (5 vs 4), and multilingual quality (5 vs 4). Devstral Small 1.1 punches back on tool calling (4 vs 3), faithfulness (4 vs 3), and classification (4 vs 3), making it a credible choice for agentic software workflows where those capabilities matter most. At $0.79 vs $0.30 per million output tokens, V3.1 Terminus costs 2.6x more — a gap that matters at scale, though Devstral's weaker scores on planning (2/5, rank 53 of 54) and creative reasoning (2/5, rank 47 of 54) make it a poor fit outside its coding-agent niche.
deepseek
DeepSeek V3.1 Terminus
Benchmark Scores
External Benchmarks
Pricing
Input
$0.210/MTok
Output
$0.790/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, DeepSeek V3.1 Terminus wins 7 benchmarks, Devstral Small 1.1 wins 4, and they tie on 1.
Where V3.1 Terminus leads:
- Strategic analysis: 5 vs 2. V3.1 Terminus ties for 1st among 54 models tested; Devstral ranks 44th. For nuanced tradeoff reasoning with real numbers, this is a decisive gap.
- Long context: 5 vs 4. V3.1 Terminus ties for 1st among 55 models; Devstral ranks 38th. At 163,840 tokens of context window vs Devstral's 131,072, V3.1 Terminus also has a structural advantage for document-heavy workflows.
- Structured output: 5 vs 4. V3.1 Terminus ties for 1st among 54 models; Devstral ranks 26th. For JSON schema compliance in production pipelines, this matters.
- Multilingual: 5 vs 4. V3.1 Terminus ties for 1st among 55 models; Devstral ranks 36th. Non-English use cases clearly favor V3.1 Terminus.
- Creative problem solving: 4 vs 2. V3.1 Terminus ranks 9th of 54; Devstral ranks 47th. For generating non-obvious, feasible ideas, Devstral struggles significantly.
- Agentic planning: 4 vs 2. V3.1 Terminus ranks 16th of 54; Devstral ranks dead last at 53rd of 54. Goal decomposition and failure recovery are near-absent in Devstral.
- Persona consistency: 4 vs 2. V3.1 Terminus ranks 38th of 53; Devstral ranks 51st. Neither excels here, but V3.1 Terminus is notably better.
Where Devstral Small 1.1 leads:
- Tool calling: 4 vs 3. Devstral ranks 18th of 54; V3.1 Terminus ranks 47th. For function selection, argument accuracy, and sequencing — the core of agentic code execution — Devstral has a real edge.
- Faithfulness: 4 vs 3. Devstral ranks 34th of 55; V3.1 Terminus ranks 52nd — near the bottom. Devstral is meaningfully more reliable at sticking to source material without hallucinating. This is a significant weakness in V3.1 Terminus.
- Classification: 4 vs 3. Devstral ties for 1st among 53 models; V3.1 Terminus ranks 31st. For routing and categorization tasks, Devstral is the clear choice.
- Safety calibration: 2 vs 1. Devstral ranks 12th of 55; V3.1 Terminus ranks 32nd. Neither model excels here — both score below the median — but Devstral is less likely to over-refuse or under-refuse harmful requests.
Tie:
- Constrained rewriting: Both score 3/5, ranked 31st of 53. Neither model is competitive here relative to the field.
The pattern is clear: V3.1 Terminus is a broader, more capable general model. Devstral Small 1.1 is a specialized coding-agent model that excels precisely where software engineering agents need it most — tool calling and faithfulness — but underperforms badly on reasoning, planning, and creativity.
Pricing Analysis
DeepSeek V3.1 Terminus costs $0.21/M input tokens and $0.79/M output tokens. Devstral Small 1.1 costs $0.10/M input and $0.30/M output — roughly half the input cost and 62% cheaper on output. At 1M output tokens/month, that's $790 vs $300 — a $490 difference that most teams won't notice. At 10M tokens/month, you're paying $7,900 vs $3,000, a $4,900 monthly gap that starts to matter for budget-conscious teams. At 100M tokens/month — high-volume production pipelines — V3.1 Terminus runs $79,000 vs Devstral's $30,000, a $49,000 annual difference that demands justification. The cost gap is real, but so is the capability gap: V3.1 Terminus scores 5/5 on strategic analysis and agentic planning (4/5) vs Devstral's 2/5 on both. For general-purpose use, the quality premium is likely worth it until you're past roughly 10M tokens/month. For narrow coding-agent workflows where tool calling and faithfulness dominate, Devstral's cost advantage is harder to ignore.
Real-World Cost Comparison
Bottom Line
Choose DeepSeek V3.1 Terminus if you need a general-purpose model for analysis, long-document processing, multilingual output, or structured data extraction. Its 5/5 scores on strategic analysis, long context, structured output, and multilingual — all tied for 1st in our tests — make it the stronger default across most professional use cases. It also supports a broader parameter set including reasoning, logit_bias, min_p, and top_k, giving developers more control. Be aware that its faithfulness score of 3/5 (rank 52 of 55) means you should build verification steps into any RAG or summarization pipeline.
Choose Devstral Small 1.1 if you're building software engineering agents where tool calling (4/5, rank 18 of 54) and faithfulness (4/5, rank 34 of 55) are the primary requirements, and you want to minimize cost at high output volumes. Its 24B parameter architecture and collaboration with All Hands AI make it purpose-built for agentic code workflows. However, its agentic planning score of 2/5 (rank 53 of 54) is a serious concern — it can execute tools but struggles with multi-step goal decomposition. If your agent needs to reason about failure modes and replan, Devstral Small 1.1 will disappoint.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.