Devstral Medium vs Llama 4 Maverick

Llama 4 Maverick is the better default choice for most teams: it costs 3.3× less per output token ($0.60/M vs $2.00/M) while matching Devstral Medium on six of eleven benchmarks and outperforming it on safety calibration and persona consistency. Devstral Medium earns its premium specifically for agentic and tool-calling workloads, where it scores 4/5 vs Maverick's 3/5 on agentic planning and 3/5 vs a rate-limited result on tool calling. If your pipeline doesn't depend heavily on autonomous agent loops or function orchestration, the cost gap is hard to justify.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

meta

Llama 4 Maverick

Overall
3.36/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Classification
3/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Across 11 comparable benchmarks in our testing, Devstral Medium wins 3, Llama 4 Maverick wins 3, and 6 are tied.

Devstral Medium wins:

  • Tool calling (3 vs rate-limited): Devstral Medium scored 3/5 on function selection, argument accuracy, and sequencing. Maverick's tool calling test hit a 429 rate limit on the test date, so no clean comparison is possible here — treat Devstral Medium's result as the only verified data point.
  • Classification (4 vs 3): Devstral Medium scored 4/5, tied for 1st with 29 other models out of 53 tested. Maverick scored 3/5, ranking 31st of 53. For routing, intent detection, and categorization tasks, Devstral Medium has a meaningful edge.
  • Agentic planning (4 vs 3): Devstral Medium scored 4/5, ranking 16th of 54. Maverick scored 3/5, ranking 42nd of 54. This is the most practically significant gap — goal decomposition and failure recovery are core to any agentic workflow, and Devstral Medium sits in the upper third of the field while Maverick sits near the bottom third.

Llama 4 Maverick wins:

  • Persona consistency (5 vs 3): Maverick scored 5/5, tied for 1st with 36 other models out of 53. Devstral Medium scored 3/5, ranking 45th of 53. If you're building a chatbot, roleplay system, or any product requiring stable character under adversarial prompting, Maverick is clearly superior.
  • Safety calibration (2 vs 1): Maverick scored 2/5, ranking 12th of 55. Devstral Medium scored 1/5, ranking 32nd of 55. A score of 1 puts Devstral Medium at the 25th percentile of tested models on refusing harmful requests while permitting legitimate ones — a real concern for customer-facing deployments.
  • Creative problem solving (3 vs 2): Maverick scored 3/5, ranking 30th of 54. Devstral Medium scored 2/5, ranking 47th of 54. For generating non-obvious, feasible ideas, Maverick is demonstrably better.

Six-way ties (both models score equally):

  • Structured output: both 4/5 (rank 26 of 54)
  • Strategic analysis: both 2/5 (rank 44 of 54) — both models rank near the bottom on nuanced tradeoff reasoning
  • Constrained rewriting: both 3/5 (rank 31 of 53)
  • Faithfulness: both 4/5 (rank 34 of 55)
  • Long context: both 4/5 (rank 38 of 55)
  • Multilingual: both 4/5 (rank 36 of 55)

Note: Llama 4 Maverick was not tested on tool calling in our suite due to a rate limit event (429 error, noted as likely transient). That result is excluded from the win/loss tally but flagged for transparency.

Neither model has external benchmark scores (SWE-bench Verified, AIME 2025, MATH Level 5) in the payload, so no third-party supplementary data is available for this comparison.

BenchmarkDevstral MediumLlama 4 Maverick
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/50/5
Classification4/53/5
Agentic Planning4/53/5
Structured Output4/54/5
Safety Calibration1/52/5
Strategic Analysis2/52/5
Persona Consistency3/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/53/5
Summary3 wins3 wins

Pricing Analysis

Devstral Medium costs $0.40/M input and $2.00/M output. Llama 4 Maverick costs $0.15/M input and $0.60/M output. The output cost gap drives most of the math in practice.

At 1M output tokens/month: Devstral Medium costs $2.00; Maverick costs $0.60 — a $1.40 difference that's trivial.

At 10M output tokens/month: Devstral Medium costs $20.00; Maverick costs $6.00 — a $14 gap that starts to matter for startups.

At 100M output tokens/month: Devstral Medium costs $200; Maverick costs $60 — a $140/month difference that becomes a real budget line item for high-volume production workloads.

Who should care: Any team running batch jobs, code review pipelines, or high-throughput document processing will feel the 3.3× output cost ratio quickly. The savings case for Maverick is strongest wherever its benchmark parity with Devstral Medium holds — which is the majority of tasks tested. Teams that specifically need agentic loop performance and can demonstrate a quality difference in production should consider the Devstral Medium premium justified.

Real-World Cost Comparison

TaskDevstral MediumLlama 4 Maverick
iChat response$0.0011<$0.001
iBlog post$0.0042$0.0013
iDocument batch$0.108$0.033
iPipeline run$1.08$0.330

Bottom Line

Choose Devstral Medium if:

  • Your application runs agentic loops where goal decomposition and failure recovery matter — it scores 4/5 vs Maverick's 3/5 on agentic planning (rank 16 vs rank 42 of 54).
  • You need reliable classification or routing logic — it scores 4/5 vs Maverick's 3/5, placing it among the top 30 models tested.
  • Tool calling is central to your pipeline and you want the only verified result in this head-to-head (Maverick's test was rate-limited).
  • Your volume is low enough that the 3.3× output cost premium ($2.00 vs $0.60/M tokens) doesn't compound into a budget problem.

Choose Llama 4 Maverick if:

  • You're building a consumer-facing product — its 5/5 persona consistency (tied for 1st of 53) makes it far more reliable for chatbots and character-driven interfaces.
  • Safety is non-negotiable — Devstral Medium's 1/5 safety calibration score is the lowest tier in our testing, while Maverick's 2/5 ranks 12th of 55.
  • Your workload spans creative ideation or brainstorming — Maverick scores 3/5 vs 2/5 on creative problem solving.
  • You process high token volumes — at 100M output tokens/month, Maverick saves $140 vs Devstral Medium with equivalent results on six benchmarks.
  • You need multimodal input — Maverick accepts image input (text+image→text); Devstral Medium is text-only per the payload.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions