Claude Opus 4.7 vs Devstral Medium
Claude Opus 4.7 is the stronger general-purpose model by a wide margin, winning 9 of 12 benchmarks in our testing — including dominant leads on tool calling, agentic planning, strategic analysis, and creative problem solving. Devstral Medium's sole benchmark win is classification, and it costs 12.5x less on output tokens ($2 vs $25 per million), making it a real contender for high-volume, classification-heavy pipelines where the capability gap doesn't matter. For most professional and agentic workloads, Opus 4.7 justifies its premium; for cost-sensitive, narrowly scoped tasks, Devstral Medium earns its place.
anthropic
Claude Opus 4.7
Benchmark Scores
External Benchmarks
Pricing
Input
$5.00/MTok
Output
$25.00/MTok
modelpicker.net
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test benchmark suite, Claude Opus 4.7 wins 9 categories outright, Devstral Medium wins 1, and they tie on 2.
Where Opus 4.7 dominates:
Tool calling is the starkest gap: Opus 4.7 scores 5/5 (tied for 1st among 55 models) versus Devstral Medium's 3/5 (ranked 48th of 55). For agentic workflows that depend on accurate function selection and argument passing, this is a decisive difference. A score of 3 on tool calling means models in this tier frequently mismatch arguments or missequence calls — a meaningful failure mode in production agents.
Agentic planning follows the same pattern: Opus 4.7 scores 5/5 (tied for 1st among 55 models) versus Devstral Medium's 4/5 (ranked 17th of 55). The gap is smaller here — a 4 is a reasonable score — but Opus 4.7's consistency across both planning and tool execution makes it the clear choice for multi-step agentic systems.
Strategic analysis shows the widest qualitative gap: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 2/5 for Devstral Medium (ranked 45th of 55). A score of 2 on nuanced tradeoff reasoning indicates the model struggles with complex analytical tasks — a real limitation for research, business analysis, or decision-support use cases.
Creative problem solving mirrors this: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 2/5 for Devstral Medium (ranked 48th of 55). When tasks require non-obvious, specific, feasible ideas, Devstral Medium is near the bottom of the field.
Faithfulness: Opus 4.7 scores 5/5 (tied for 1st, 56 models) versus Devstral Medium's 4/5 (ranked 35th). Both are acceptable, but Opus 4.7 is more reliable for tasks where sticking to source material without hallucinating is critical.
Safety calibration: Opus 4.7 scores 3/5 (ranked 10th of 56 — one of only 3 models at this score) versus Devstral Medium's 1/5 (ranked 33rd of 56). Devstral Medium's score here suggests it either over-refuses or under-refuses harmful requests at a rate that would concern teams building safety-sensitive applications.
Persona consistency: 5/5 for Opus 4.7 (tied for 1st, 55 models) versus 3/5 for Devstral Medium (ranked 47th of 55). Relevant for chatbot and roleplay applications.
Long context: 5/5 for Opus 4.7 (tied for 1st, 56 models) versus 4/5 for Devstral Medium (ranked 39th). Opus 4.7 also supports a 1,000,000-token context window versus Devstral Medium's 131,072 tokens — a massive practical difference for document analysis at scale.
Constrained rewriting: 4/5 for Opus 4.7 (ranked 6th of 55) versus 3/5 for Devstral Medium (ranked 32nd of 55).
Where Devstral Medium wins:
Classification is Devstral Medium's only benchmark win: 4/5 (tied for 1st among 54 models) versus Opus 4.7's 3/5 (ranked 31st of 54). For routing, tagging, and categorization tasks, Devstral Medium is genuinely competitive with the best models in our testing — making it a legitimate choice for classification-heavy pipelines.
Ties:
Structured output and multilingual both land at 4/5 for both models, with both sharing rank 26 of 55 (structured output) and rank 36 of 56 (multilingual). No meaningful difference here.
Pricing Analysis
The cost gap here is substantial. Claude Opus 4.7 runs at $5 per million input tokens and $25 per million output tokens. Devstral Medium runs at $0.40 per million input tokens and $2 per million output tokens — a 12.5x difference on output, which is where most costs accumulate in real workloads.
At 1 million output tokens per month, Opus 4.7 costs $25 versus Devstral Medium's $2 — a $23 difference that's negligible for most teams. At 10 million output tokens, that gap becomes $230 per month, still manageable. At 100 million output tokens — the scale of a production API serving thousands of users — you're looking at $2,500 versus $200 per month, a $2,300 monthly difference that demands justification.
Developers building internal tools or low-volume prototypes should choose on capability alone; Opus 4.7 wins that argument. Teams running high-throughput classification pipelines, document routing systems, or cost-sensitive inference at scale should take Devstral Medium seriously — especially given it actually outperforms Opus 4.7 on classification in our testing. The break-even question is whether the capability gap costs you more in rework, errors, or engineering time than the $2,300/month you'd save.
Real-World Cost Comparison
Bottom Line
Choose Claude Opus 4.7 if:
- You're building agentic systems or AI workflows that rely on tool calling and multi-step planning — its 5/5 scores in both categories, versus Devstral Medium's 3/5 and 4/5, translate directly to fewer broken agent runs.
- Your application involves strategic analysis, research synthesis, or complex reasoning — Devstral Medium's 2/5 on strategic analysis makes it genuinely unsuitable for these tasks.
- You need a context window beyond 131,072 tokens — Opus 4.7's 1,000,000-token window is in a different class for large-document work.
- Safety calibration matters — Opus 4.7 scores 3/5 versus Devstral Medium's 1/5, placing it significantly higher in our safety testing.
- Volume is moderate (under ~50M output tokens/month) and capability is the primary concern.
Choose Devstral Medium if:
- Your core use case is document classification, content routing, or tagging — it tied for 1st on classification in our testing, outperforming Opus 4.7.
- You're running a high-throughput production system where output volume exceeds tens of millions of tokens monthly and the $2 vs $25 per million token difference creates real budget pressure.
- Your application is narrowly scoped to tasks Devstral Medium handles well (classification, structured output, basic agentic planning) and you don't need strong strategic reasoning or creative problem solving.
- You want explicit control over generation parameters — Devstral Medium exposes temperature, top_p, seed, frequency and presence penalties, and more through supported API parameters.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.