Claude Sonnet 4.6 vs Devstral Medium
In our testing Claude Sonnet 4.6 is the winner for most professional and agentic workflows — it wins 9 of 12 benchmarks including safety, tool calling, and long-context. Devstral Medium does not win any of the 12 internal benchmarks but is a clear cost-saving choice (Sonnet input/output $3/$15 per mTok vs Devstral $0.4/$2 per mTok).
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
We compare the 12-test suite (scores 1–5) run in our testing. Claude Sonnet 4.6 wins 9 tests, Devstral Medium wins 0, and 3 tests tie. Wins for Claude Sonnet 4.6 (scores and rank context):
- Safety calibration: 5 — tied for 1st of 55 ("tied for 1st with 4 other models"). This means Sonnet reliably refuses harmful requests and permits legitimate ones in our tests.
- Tool calling: 5 — tied for 1st of 54 ("tied for 1st with 16 other models"). Sonnet selects functions, arguments, and sequencing accurately in our tool-calling scenarios.
- Long context: 5 — tied for 1st of 55 ("tied for 1st with 36 other models"). Retrieval and coherence at 30K+ tokens are top-tier in our tests.
- Agentic planning: 5 — tied for 1st of 54 ("tied for 1st with 14 other models"). Sonnet decomposes goals and plans recoveries better in our agent workflows.
- Faithfulness: 5 — tied for 1st of 55 ("tied for 1st with 32 other models"). Outputs stick to source material with low hallucination in our tests.
- Persona consistency, multilingual, creative problem solving, strategic analysis: all 5s with top ranks (persona consistency tied for 1st of 53; multilingual tied for 1st of 55; creative problem solving tied for 1st of 54; strategic analysis tied for 1st of 54). These indicate strong behavior consistency, multi‑language parity, and high-quality problem ideation and nuanced tradeoffs. Ties (both models): structured output (both 4, rank 26/54), constrained rewriting (both 3, rank 31/53), classification (both 4 and tied for 1st of 53). For tasks needing strict JSON/schema compliance or classification, both models perform equivalently in our suite. Devstral Medium scores: its highest marks are 4s in classification, faithfulness, structured output, long context and agentic planning (classification is tied for 1st), but it scores lower elsewhere — safety calibration 1 (rank 32/55) and creative problem solving 2 (rank 47/54) indicate weaknesses for safety-sensitive or inventive tasks in our tests. External benchmarks: Beyond our internal suite, Claude Sonnet 4.6 scores 75.2% on SWE-bench Verified (Epoch AI) and ranks 4 of 12 on that external coding benchmark; it also scores 85.8% on AIME 2025 (Epoch AI) and ranks 10 of 23. Devstral Medium has no external SWE-bench/AIME scores present in the payload. We present the external numbers as reported by Epoch AI.
Pricing Analysis
Costs are steeply different: Claude Sonnet 4.6 charges $3 input and $15 output per mTok; Devstral Medium charges $0.4 input and $2 output per mTok (price ratio 7.5). Translating to token volumes (1 mTok = 1,000 tokens):
- 1M tokens (mTok=1,000): Sonnet = $3,000 input / $15,000 output (50/50 split = $9,000); Devstral = $400 input / $2,000 output (50/50 = $1,200).
- 10M tokens (mTok=10,000): Sonnet = $30,000 / $150,000 (50/50 = $90,000); Devstral = $4,000 / $20,000 (50/50 = $12,000).
- 100M tokens (mTok=100,000): Sonnet = $300,000 / $1,500,000 (50/50 = $900,000); Devstral = $40,000 / $200,000 (50/50 = $120,000). Who should care: enterprises or apps running millions of tokens/month (chatbots, coding CI, large-scale inference) will see order-of-magnitude cost differences — choose Devstral for strict cost limits. Teams that need top-ranked safety, tool-calling, long-context and agentic planning should budget for Sonnet despite higher spend.
Real-World Cost Comparison
Bottom Line
Choose Claude Sonnet 4.6 if you need best-in-class safety, tool calling, long-context retrieval, agentic planning, or high-fidelity multi‑language and strategic outputs — our testing shows Sonnet wins 9 of 12 benchmarks and posts 75.2% on SWE-bench Verified (Epoch AI). Budget for higher cost: $3 input / $15 output per mTok. Choose Devstral Medium if your priority is cost at scale and you need competitive classification/structured-output performance at a fraction of the price — $0.4 input / $2 output per mTok — or if running very high token volumes where the 7.5× price gap dominates decision-making.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.