DeepSeek V3.1 vs Devstral Medium
DeepSeek V3.1 is the better pick for most applications: it wins 6 of 12 benchmarks in our testing and excels at long-context (5/5), faithfulness (5/5) and structured output (5/5) while being much cheaper. Devstral Medium only wins classification (4/5) and offers a larger 131,072-token context window but comes at substantially higher cost ($0.40/$2 per 1k tokens).
deepseek
DeepSeek V3.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.750/MTok
modelpicker.net
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
We ran both models across our 12-test suite; scores are on a 1–5 scale and rankings reference our 52–55 model pool. Test-by-test (score A = DeepSeek V3.1, score B = Devstral Medium):
- faithfulness: A 5 vs B 4 — DeepSeek wins; ranks tied for 1st with 32 others out of 55, indicating top-tier source fidelity (sticking to input material).
- constrained_rewriting: A 3 vs B 3 — tie; both rank ~31/53. This suggests similar behavior when compressing within tight character limits.
- safety_calibration: A 1 vs B 1 — tie; both low-ranked (32/55), so neither model is strong at safely refusing harmful prompts in our tests.
- tool_calling: A 3 vs B 3 — tie; both rank 47/54, so function-selection and argument accuracy are middle-to-low compared with the field.
- structured_output: A 5 vs B 4 — DeepSeek wins; A is tied for 1st (with 24 others of 54), meaning much stronger JSON/schema compliance in our tests.
- agentic_planning: A 4 vs B 4 — tie; both rank 16/54, indicating similar goal decomposition and recovery abilities.
- multilingual: A 4 vs B 4 — tie; both rank 36/55, showing comparable non-English quality in our sampling.
- classification: A 3 vs B 4 — Devstral wins; B is tied for 1st with 29 others out of 53, so Devstral is the better model for routing/categorization tasks in our suite.
- long_context: A 5 vs B 4 — DeepSeek wins; A tied for 1st with 36 others (out of 55) despite its 32K window vs Devstral's 131K window—this means DeepSeek performed better on retrieval/accuracy at long contexts in our tests.
- persona_consistency: A 5 vs B 3 — DeepSeek wins; A tied for 1st with 36 others (out of 53), so it better maintains characters and resists injection in our evaluation.
- strategic_analysis: A 4 vs B 2 — DeepSeek wins; A ranks 27/54, showing stronger nuanced tradeoff reasoning for number-driven decisions.
- creative_problem_solving: A 5 vs B 2 — DeepSeek wins; A tied for 1st with 7 others (out of 54), meaning it consistently produced more non-obvious, feasible ideas in our tests. Overall: DeepSeek wins 6 tests (structured_output, strategic_analysis, creative_problem_solving, faithfulness, long_context, persona_consistency), Devstral wins 1 test (classification), and 5 tests tie (constrained_rewriting, tool_calling, safety_calibration, agentic_planning, multilingual). Rankings show DeepSeek is top-tier for schema adherence, long-context behavior, and faithfulness; Devstral is strongest for classification in our benchmark set.
Pricing Analysis
Prices (per 1k tokens): DeepSeek V3.1 input $0.15, output $0.75; Devstral Medium input $0.40, output $2.00. Assuming a 50/50 input/output token split: for 1M tokens/month (1,000 mTok) DeepSeek costs $450 vs Devstral $1,200; for 10M tokens DeepSeek $4,500 vs Devstral $12,000; for 100M tokens DeepSeek $45,000 vs Devstral $120,000. DeepSeek runs at 37.5% of Devstral's cost (priceRatio 0.375) under this split—so high-volume products, cost-sensitive deployments, and SaaS apps should care deeply about the gap. If you only need small-scale experimentation or classification-heavy workloads, the higher Devstral price may still be acceptable; for production throughput or heavy output use, DeepSeek is far more cost-efficient.
Real-World Cost Comparison
Bottom Line
Choose DeepSeek V3.1 if you need reliable long-context retrieval, strict JSON/schema output, high faithfulness, persona consistency, or creative problem solving at much lower cost—examples: document retrieval and structured-extraction pipelines, production chatbots that must follow a schema, or high-volume generative workloads. Choose Devstral Medium if your primary need is top-tier classification/routing and you require a very large context window (131,072 tokens) and can absorb higher costs—examples: specialized classifier endpoints, experiments that need extreme context length and are low-volume or where classification accuracy is the priority.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.