DeepSeek V3.1 vs Devstral Small 1.1
For quality-first applications (structured outputs, long-context retrieval, faithful summaries), DeepSeek V3.1 is the better pick—it wins 7 of 12 benchmarks in our testing. Devstral Small 1.1 is the pragmatic choice when cost and function-calling/classification matter, trading lower accuracy on creative and persona tasks for ~2.5x lower output price.
deepseek
DeepSeek V3.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.750/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
We ran both models across our 12-test suite and report results as 1–5 scores (our testing). Summary (DeepSeek A vs Devstral B):
- Faithfulness: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 32 others out of 55 on faithfulness, meaning it sticks to source material more reliably in our tests.
- Constrained rewriting: A 3 vs B 3 — tie; both rank 31 of 53 (22 models share this score), so neither is especially strong at tight compression limits.
- Safety calibration: A 1 vs B 2 — Devstral wins; Devstral ranks 12 of 55 (20 models share that score) versus DeepSeek rank 32, so Devstral is more likely to refuse or permit correctly in our safety scenarios.
- Tool calling: A 3 vs B 4 — Devstral wins and ranks 18 of 54 (tied with many), while DeepSeek ranks 47 of 54; in practice Devstral is better at function selection, arguments and sequencing in our tool-calling tests.
- Structured output: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 24 others out of 54, indicating superior JSON/schema adherence in our format-compliance tests.
- Agentic planning: A 4 vs B 2 — DeepSeek wins (rank 16 of 54 vs Devstral rank 53), so goal decomposition and recovery behaved better in our tests for DeepSeek.
- Multilingual: A 4 vs B 4 — tie; both rank similarly (DeepSeek rank 36/55, Devstral rank 36/55), so non-English parity is equivalent in our suite.
- Classification: A 3 vs B 4 — Devstral wins and is tied for 1st with 29 others out of 53, making it better for routing and categorization in our tests.
- Long-context: A 5 vs B 4 — DeepSeek wins and is tied for 1st with 36 others out of 55, despite DeepSeek's 32K context vs Devstral's 131K window in the payload; in our retrieval/accuracy tests DeepSeek handled long-context tasks more accurately.
- Persona consistency: A 5 vs B 2 — DeepSeek wins and is tied for 1st with 36 others out of 53, showing stronger resistance to injection and character drift in our tests.
- Strategic analysis: A 4 vs B 2 — DeepSeek wins (rank 27/54) and produced better nuanced tradeoff reasoning with real numbers in our scenarios.
- Creative problem solving: A 5 vs B 2 — DeepSeek wins and is tied for 1st with 7 others out of 54, delivering more non-obvious feasible ideas in our tasks. Overall, DeepSeek wins 7 categories (structured_output, strategic_analysis, creative_problem_solving, faithfulness, long_context, persona_consistency, agentic_planning). Devstral wins 3 (tool_calling, classification, safety_calibration). Two are ties (constrained_rewriting, multilingual). These differences map to concrete behaviors: choose DeepSeek when you need schema fidelity, deep reasoning, creativity and persona retention; choose Devstral when you need cheaper inference, stronger classification, and more reliable tool selection in our tests.
Pricing Analysis
Per the payload, DeepSeek V3.1 charges $0.15/mTok input and $0.75/mTok output; Devstral Small 1.1 charges $0.10/mTok input and $0.30/mTok output. Example monthly costs (mTok = 1,000 tokens):
- Balanced 50/50 input/output at 1M tokens: DeepSeek = $450 (input $150 + output $300); Devstral = $200 (input $100 + output $100). Gap = $250/month.
- At 10M tokens (50/50): DeepSeek = $4,500; Devstral = $2,000. Gap = $2,500/month.
- At 100M tokens (50/50): DeepSeek = $45,000; Devstral = $20,000. Gap = $25,000/month. If usage is output-heavy (e.g., long generated responses), the output-rate difference ($0.75 vs $0.30/mTok) dominates costs; at 1M output-only tokens DeepSeek = $750 vs Devstral = $300. Teams with high-volume production apps, chat services with long replies, or tight budgets should care about this gap; proof-of-concept, developer experimentation, and lower-volume services will find Devstral materially cheaper.
Real-World Cost Comparison
Bottom Line
Choose DeepSeek V3.1 if you need: high-fidelity outputs, robust long-context retrieval, strict structured output (5/5), creative problem solving (5/5), persona consistency (5/5), and stronger agentic planning — in our tests it wins 7 of 12 benchmarks. Choose Devstral Small 1.1 if you need: lower cost (output $0.30 vs $0.75/mTok), better tool calling (4 vs 3) and classification (4 vs 3), or are shipping high-volume production where the 2.5x price ratio matters. If your product is cost-sensitive and depends on function-calling/labeling, pick Devstral; if quality, faithfulness and complex reasoning drive business value, pick DeepSeek.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.