Claude Opus 4.7 vs Devstral Medium

Claude Opus 4.7 is the stronger general-purpose model by a wide margin, winning 9 of 12 benchmarks in our testing — including dominant leads on tool calling, agentic planning, strategic analysis, and creative problem solving. Devstral Medium's sole benchmark win is classification, and it costs 12.5x less on output tokens ($2 vs $25 per million), making it a real contender for high-volume, classification-heavy pipelines where the capability gap doesn't matter. For most professional and agentic workloads, Opus 4.7 justifies its premium; for cost-sensitive, narrowly scoped tasks, Devstral Medium earns its place.

anthropic

Claude Opus 4.7

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
3/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$5.00/MTok

Output

$25.00/MTok

Context Window1000K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Across our 12-test benchmark suite, Claude Opus 4.7 wins 9 categories outright, Devstral Medium wins 1, and they tie on 2.

Where Opus 4.7 dominates:

Tool calling is the starkest gap: Opus 4.7 scores 5/5 (tied for 1st among 55 models) versus Devstral Medium's 3/5 (ranked 48th of 55). For agentic workflows that depend on accurate function selection and argument passing, this is a decisive difference. A score of 3 on tool calling means models in this tier frequently mismatch arguments or missequence calls — a meaningful failure mode in production agents.

Agentic planning follows the same pattern: Opus 4.7 scores 5/5 (tied for 1st among 55 models) versus Devstral Medium's 4/5 (ranked 17th of 55). The gap is smaller here — a 4 is a reasonable score — but Opus 4.7's consistency across both planning and tool execution makes it the clear choice for multi-step agentic systems.

Strategic analysis shows the widest qualitative gap: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 2/5 for Devstral Medium (ranked 45th of 55). A score of 2 on nuanced tradeoff reasoning indicates the model struggles with complex analytical tasks — a real limitation for research, business analysis, or decision-support use cases.

Creative problem solving mirrors this: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 2/5 for Devstral Medium (ranked 48th of 55). When tasks require non-obvious, specific, feasible ideas, Devstral Medium is near the bottom of the field.

Faithfulness: Opus 4.7 scores 5/5 (tied for 1st, 56 models) versus Devstral Medium's 4/5 (ranked 35th). Both are acceptable, but Opus 4.7 is more reliable for tasks where sticking to source material without hallucinating is critical.

Safety calibration: Opus 4.7 scores 3/5 (ranked 10th of 56 — one of only 3 models at this score) versus Devstral Medium's 1/5 (ranked 33rd of 56). Devstral Medium's score here suggests it either over-refuses or under-refuses harmful requests at a rate that would concern teams building safety-sensitive applications.

Persona consistency: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 3/5 for Devstral Medium (ranked 47th of 55). Relevant for chatbot and roleplay applications.

Long context: 5/5 for Opus 4.7 (tied for 1st, 56 models) versus 4/5 for Devstral Medium (ranked 39th). Opus 4.7 also supports a 1,000,000-token context window versus Devstral Medium's 131,072 tokens — a massive practical difference for document analysis at scale.

Constrained rewriting: 4/5 for Opus 4.7 (ranked 6th of 55) versus 3/5 for Devstral Medium (ranked 32nd of 55).

Where Devstral Medium wins:

Classification is Devstral Medium's only benchmark win: 4/5 (tied for 1st among 54 models) versus Opus 4.7's 3/5 (ranked 31st of 54). For routing, tagging, and categorization tasks, Devstral Medium is genuinely competitive with the best models in our testing — making it a legitimate choice for classification-heavy pipelines.

Ties:

Structured output and multilingual both land at 4/5 for both models, with both sharing rank 26 of 55 (structured output) and rank 36 of 56 (multilingual). No meaningful difference here.

BenchmarkClaude Opus 4.7Devstral Medium
Faithfulness5/54/5
Long Context5/54/5
Multilingual4/54/5
Tool Calling5/53/5
Classification3/54/5
Agentic Planning5/54/5
Structured Output4/54/5
Safety Calibration3/51/5
Strategic Analysis5/52/5
Persona Consistency5/53/5
Constrained Rewriting4/53/5
Creative Problem Solving5/52/5
Summary9 wins1 wins

Pricing Analysis

The cost gap here is substantial. Claude Opus 4.7 runs at $5 per million input tokens and $25 per million output tokens. Devstral Medium runs at $0.40 per million input tokens and $2 per million output tokens — a 12.5x difference on output, which is where most costs accumulate in real workloads.

At 1 million output tokens per month, Opus 4.7 costs $25 versus Devstral Medium's $2 — a $23 difference that's negligible for most teams. At 10 million output tokens, that gap becomes $230 per month, still manageable. At 100 million output tokens — the scale of a production API serving thousands of users — you're looking at $2,500 versus $200 per month, a $2,300 monthly difference that demands justification.

Developers building internal tools or low-volume prototypes should choose on capability alone; Opus 4.7 wins that argument. Teams running high-throughput classification pipelines, document routing systems, or cost-sensitive inference at scale should take Devstral Medium seriously — especially given it actually outperforms Opus 4.7 on classification in our testing. The break-even question is whether the capability gap costs you more in rework, errors, or engineering time than the $2,300/month you'd save.

Real-World Cost Comparison

TaskClaude Opus 4.7Devstral Medium
iChat response$0.014$0.0011
iBlog post$0.053$0.0042
iDocument batch$1.35$0.108
iPipeline run$13.50$1.08

Bottom Line

Choose Claude Opus 4.7 if:

  • You're building agentic systems or AI workflows that rely on tool calling and multi-step planning — its 5/5 scores in both categories, versus Devstral Medium's 3/5 and 4/5, translate directly to fewer broken agent runs.
  • Your application involves strategic analysis, research synthesis, or complex reasoning — Devstral Medium's 2/5 on strategic analysis makes it genuinely unsuitable for these tasks.
  • You need a context window beyond 131,072 tokens — Opus 4.7's 1,000,000-token window is in a different class for large-document work.
  • Safety calibration matters — Opus 4.7 scores 3/5 versus Devstral Medium's 1/5, placing it significantly higher in our safety testing.
  • Volume is moderate (under ~50M output tokens/month) and capability is the primary concern.

Choose Devstral Medium if:

  • Your core use case is document classification, content routing, or tagging — it tied for 1st on classification in our testing, outperforming Opus 4.7.
  • You're running a high-throughput production system where output volume exceeds tens of millions of tokens monthly and the $2 vs $25 per million token difference creates real budget pressure.
  • Your application is narrowly scoped to tasks Devstral Medium handles well (classification, structured output, basic agentic planning) and you don't need strong strategic reasoning or creative problem solving.
  • You want explicit control over generation parameters — Devstral Medium exposes temperature, top_p, seed, frequency and presence penalties, and more through supported API parameters.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions