Claude Sonnet 4.6 vs Devstral Medium

In our testing Claude Sonnet 4.6 is the winner for most professional and agentic workflows — it wins 9 of 12 benchmarks including safety, tool calling, and long-context. Devstral Medium does not win any of the 12 internal benchmarks but is a clear cost-saving choice (Sonnet input/output $3/$15 per mTok vs Devstral $0.4/$2 per mTok).

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

We compare the 12-test suite (scores 1–5) run in our testing. Claude Sonnet 4.6 wins 9 tests, Devstral Medium wins 0, and 3 tests tie. Wins for Claude Sonnet 4.6 (scores and rank context):

  • Safety calibration: 5 — tied for 1st of 55 ("tied for 1st with 4 other models"). This means Sonnet reliably refuses harmful requests and permits legitimate ones in our tests.
  • Tool calling: 5 — tied for 1st of 54 ("tied for 1st with 16 other models"). Sonnet selects functions, arguments, and sequencing accurately in our tool-calling scenarios.
  • Long context: 5 — tied for 1st of 55 ("tied for 1st with 36 other models"). Retrieval and coherence at 30K+ tokens are top-tier in our tests.
  • Agentic planning: 5 — tied for 1st of 54 ("tied for 1st with 14 other models"). Sonnet decomposes goals and plans recoveries better in our agent workflows.
  • Faithfulness: 5 — tied for 1st of 55 ("tied for 1st with 32 other models"). Outputs stick to source material with low hallucination in our tests.
  • Persona consistency, multilingual, creative problem solving, strategic analysis: all 5s with top ranks (persona consistency tied for 1st of 53; multilingual tied for 1st of 55; creative problem solving tied for 1st of 54; strategic analysis tied for 1st of 54). These indicate strong behavior consistency, multi‑language parity, and high-quality problem ideation and nuanced tradeoffs. Ties (both models): structured output (both 4, rank 26/54), constrained rewriting (both 3, rank 31/53), classification (both 4 and tied for 1st of 53). For tasks needing strict JSON/schema compliance or classification, both models perform equivalently in our suite. Devstral Medium scores: its highest marks are 4s in classification, faithfulness, structured output, long context and agentic planning (classification is tied for 1st), but it scores lower elsewhere — safety calibration 1 (rank 32/55) and creative problem solving 2 (rank 47/54) indicate weaknesses for safety-sensitive or inventive tasks in our tests. External benchmarks: Beyond our internal suite, Claude Sonnet 4.6 scores 75.2% on SWE-bench Verified (Epoch AI) and ranks 4 of 12 on that external coding benchmark; it also scores 85.8% on AIME 2025 (Epoch AI) and ranks 10 of 23. Devstral Medium has no external SWE-bench/AIME scores present in the payload. We present the external numbers as reported by Epoch AI.
BenchmarkClaude Sonnet 4.6Devstral Medium
Faithfulness5/54/5
Long Context5/54/5
Multilingual5/54/5
Tool Calling5/53/5
Classification4/54/5
Agentic Planning5/54/5
Structured Output4/54/5
Safety Calibration5/51/5
Strategic Analysis5/52/5
Persona Consistency5/53/5
Constrained Rewriting3/53/5
Creative Problem Solving5/52/5
Summary9 wins0 wins

Pricing Analysis

Costs are steeply different: Claude Sonnet 4.6 charges $3 input and $15 output per mTok; Devstral Medium charges $0.4 input and $2 output per mTok (price ratio 7.5). Translating to token volumes (1 mTok = 1,000 tokens):

  • 1M tokens (mTok=1,000): Sonnet = $3,000 input / $15,000 output (50/50 split = $9,000); Devstral = $400 input / $2,000 output (50/50 = $1,200).
  • 10M tokens (mTok=10,000): Sonnet = $30,000 / $150,000 (50/50 = $90,000); Devstral = $4,000 / $20,000 (50/50 = $12,000).
  • 100M tokens (mTok=100,000): Sonnet = $300,000 / $1,500,000 (50/50 = $900,000); Devstral = $40,000 / $200,000 (50/50 = $120,000). Who should care: enterprises or apps running millions of tokens/month (chatbots, coding CI, large-scale inference) will see order-of-magnitude cost differences — choose Devstral for strict cost limits. Teams that need top-ranked safety, tool-calling, long-context and agentic planning should budget for Sonnet despite higher spend.

Real-World Cost Comparison

TaskClaude Sonnet 4.6Devstral Medium
iChat response$0.0081$0.0011
iBlog post$0.032$0.0042
iDocument batch$0.810$0.108
iPipeline run$8.10$1.08

Bottom Line

Choose Claude Sonnet 4.6 if you need best-in-class safety, tool calling, long-context retrieval, agentic planning, or high-fidelity multi‑language and strategic outputs — our testing shows Sonnet wins 9 of 12 benchmarks and posts 75.2% on SWE-bench Verified (Epoch AI). Budget for higher cost: $3 input / $15 output per mTok. Choose Devstral Medium if your priority is cost at scale and you need competitive classification/structured-output performance at a fraction of the price — $0.4 input / $2 output per mTok — or if running very high token volumes where the 7.5× price gap dominates decision-making.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions