Devstral Medium vs Llama 4 Maverick
Llama 4 Maverick is the better default choice for most teams: it costs 3.3× less per output token ($0.60/M vs $2.00/M) while matching Devstral Medium on six of eleven benchmarks and outperforming it on safety calibration and persona consistency. Devstral Medium earns its premium specifically for agentic and tool-calling workloads, where it scores 4/5 vs Maverick's 3/5 on agentic planning and 3/5 vs a rate-limited result on tool calling. If your pipeline doesn't depend heavily on autonomous agent loops or function orchestration, the cost gap is hard to justify.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
meta
Llama 4 Maverick
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.600/MTok
modelpicker.net
Benchmark Analysis
Across 11 comparable benchmarks in our testing, Devstral Medium wins 3, Llama 4 Maverick wins 3, and 6 are tied.
Devstral Medium wins:
- Tool calling (3 vs rate-limited): Devstral Medium scored 3/5 on function selection, argument accuracy, and sequencing. Maverick's tool calling test hit a 429 rate limit on the test date, so no clean comparison is possible here — treat Devstral Medium's result as the only verified data point.
- Classification (4 vs 3): Devstral Medium scored 4/5, tied for 1st with 29 other models out of 53 tested. Maverick scored 3/5, ranking 31st of 53. For routing, intent detection, and categorization tasks, Devstral Medium has a meaningful edge.
- Agentic planning (4 vs 3): Devstral Medium scored 4/5, ranking 16th of 54. Maverick scored 3/5, ranking 42nd of 54. This is the most practically significant gap — goal decomposition and failure recovery are core to any agentic workflow, and Devstral Medium sits in the upper third of the field while Maverick sits near the bottom third.
Llama 4 Maverick wins:
- Persona consistency (5 vs 3): Maverick scored 5/5, tied for 1st with 36 other models out of 53. Devstral Medium scored 3/5, ranking 45th of 53. If you're building a chatbot, roleplay system, or any product requiring stable character under adversarial prompting, Maverick is clearly superior.
- Safety calibration (2 vs 1): Maverick scored 2/5, ranking 12th of 55. Devstral Medium scored 1/5, ranking 32nd of 55. A score of 1 puts Devstral Medium at the 25th percentile of tested models on refusing harmful requests while permitting legitimate ones — a real concern for customer-facing deployments.
- Creative problem solving (3 vs 2): Maverick scored 3/5, ranking 30th of 54. Devstral Medium scored 2/5, ranking 47th of 54. For generating non-obvious, feasible ideas, Maverick is demonstrably better.
Six-way ties (both models score equally):
- Structured output: both 4/5 (rank 26 of 54)
- Strategic analysis: both 2/5 (rank 44 of 54) — both models rank near the bottom on nuanced tradeoff reasoning
- Constrained rewriting: both 3/5 (rank 31 of 53)
- Faithfulness: both 4/5 (rank 34 of 55)
- Long context: both 4/5 (rank 38 of 55)
- Multilingual: both 4/5 (rank 36 of 55)
Note: Llama 4 Maverick was not tested on tool calling in our suite due to a rate limit event (429 error, noted as likely transient). That result is excluded from the win/loss tally but flagged for transparency.
Neither model has external benchmark scores (SWE-bench Verified, AIME 2025, MATH Level 5) in the payload, so no third-party supplementary data is available for this comparison.
Pricing Analysis
Devstral Medium costs $0.40/M input and $2.00/M output. Llama 4 Maverick costs $0.15/M input and $0.60/M output. The output cost gap drives most of the math in practice.
At 1M output tokens/month: Devstral Medium costs $2.00; Maverick costs $0.60 — a $1.40 difference that's trivial.
At 10M output tokens/month: Devstral Medium costs $20.00; Maverick costs $6.00 — a $14 gap that starts to matter for startups.
At 100M output tokens/month: Devstral Medium costs $200; Maverick costs $60 — a $140/month difference that becomes a real budget line item for high-volume production workloads.
Who should care: Any team running batch jobs, code review pipelines, or high-throughput document processing will feel the 3.3× output cost ratio quickly. The savings case for Maverick is strongest wherever its benchmark parity with Devstral Medium holds — which is the majority of tasks tested. Teams that specifically need agentic loop performance and can demonstrate a quality difference in production should consider the Devstral Medium premium justified.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if:
- Your application runs agentic loops where goal decomposition and failure recovery matter — it scores 4/5 vs Maverick's 3/5 on agentic planning (rank 16 vs rank 42 of 54).
- You need reliable classification or routing logic — it scores 4/5 vs Maverick's 3/5, placing it among the top 30 models tested.
- Tool calling is central to your pipeline and you want the only verified result in this head-to-head (Maverick's test was rate-limited).
- Your volume is low enough that the 3.3× output cost premium ($2.00 vs $0.60/M tokens) doesn't compound into a budget problem.
Choose Llama 4 Maverick if:
- You're building a consumer-facing product — its 5/5 persona consistency (tied for 1st of 53) makes it far more reliable for chatbots and character-driven interfaces.
- Safety is non-negotiable — Devstral Medium's 1/5 safety calibration score is the lowest tier in our testing, while Maverick's 2/5 ranks 12th of 55.
- Your workload spans creative ideation or brainstorming — Maverick scores 3/5 vs 2/5 on creative problem solving.
- You process high token volumes — at 100M output tokens/month, Maverick saves $140 vs Devstral Medium with equivalent results on six benchmarks.
- You need multimodal input — Maverick accepts image input (text+image→text); Devstral Medium is text-only per the payload.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.