Devstral Medium vs Llama 4 Scout
Llama 4 Scout wins more benchmarks in our testing (4 vs 1) and costs significantly less — $0.08/$0.30 per million tokens input/output versus Devstral Medium's $0.40/$2.00 — making it the stronger general-purpose choice for most workloads. Devstral Medium's sole outright win is agentic planning (4 vs 2), which matters for multi-step autonomous workflows where goal decomposition is critical. If agentic pipelines are your primary use case and budget is secondary, Devstral Medium earns its premium; otherwise, Llama 4 Scout delivers more capability per dollar.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
meta-llama
Llama 4 Scout
Benchmark Scores
External Benchmarks
Pricing
Input
$0.080/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
Neither model has been through our full 12-test internal benchmark suite with a composite average score, so this comparison is based on per-test scores from our testing. Across 12 tests, Llama 4 Scout wins 4, Devstral Medium wins 1, and 7 are ties.
Where Llama 4 Scout wins:
- Tool calling: 4 vs 3. Scout ranks 18th of 54 (tied with 28 others); Devstral Medium ranks 47th of 54. For function selection, argument accuracy, and sequencing in API-driven applications, this is a meaningful gap. Tool calling is the backbone of agentic and integration use cases, which makes Devstral Medium's weakness here notable given its positioning as an agentic model.
- Long context: 5 vs 4. Scout ties for 1st of 55 models (with 36 others); Devstral Medium ranks 38th of 55. Scout's 327,680-token context window dwarfs Devstral Medium's 131,072 tokens, and the benchmark performance matches — Scout hits the top tier on retrieval accuracy at 30K+ tokens while Devstral Medium lands in the bottom half.
- Safety calibration: 2 vs 1. Scout ranks 12th of 55 (with 19 others); Devstral Medium ranks 32nd of 55. At the median, both are below the 75th percentile (p75 = 2), but Scout scores higher. A score of 1 for Devstral Medium places it at the bottom of the field — below the 25th percentile (p25 = 1 means this is the floor). For applications that need reliable refusal of harmful requests while permitting legitimate ones, this is a real consideration.
- Creative problem solving: 3 vs 2. Scout ranks 30th of 54; Devstral Medium ranks 47th of 54. Scout generates more non-obvious, feasible ideas in our testing. Devstral Medium's score of 2 is well below the field median of 4.
Where Devstral Medium wins:
- Agentic planning: 4 vs 2. Devstral Medium ranks 16th of 54 (tied with 25 others); Llama 4 Scout ranks 53rd of 54 — second to last. This is the starkest gap in the comparison. Goal decomposition and failure recovery are areas where Devstral Medium dramatically outperforms Scout. Scout's score of 2 is near the floor (field median is 4), making it a poor choice for autonomous multi-step workflows.
Where they tie (7 tests):
- Structured output (4/4), strategic analysis (2/2), constrained rewriting (3/3), faithfulness (4/4), classification (4/4), persona consistency (3/3), and multilingual (4/4). Both models tie on classification, where they share the top score of 4 (tied for 1st of 53, with 29 others). Both rank in the bottom half on strategic analysis (rank 44 of 54) and persona consistency (rank 45 of 53). The tie on faithfulness (rank 34 of 55) and structured output (rank 26 of 54) means neither has an edge on RAG applications or JSON compliance.
The single most operationally significant divergence: if your pipeline relies on tools (function calling), Scout is the clear pick. If your pipeline requires multi-step planning and autonomous goal execution, Devstral Medium's 4 vs Scout's 2 on agentic planning represents a qualitative difference.
Pricing Analysis
Devstral Medium costs $0.40/M input and $2.00/M output. Llama 4 Scout costs $0.08/M input and $0.30/M output — a 5x gap on input and a 6.7x gap on output. At 1M output tokens/month, Devstral Medium runs $2.00 versus Llama 4 Scout's $0.30, a difference of $1.70. That gap scales fast: at 10M output tokens/month you're paying $20 vs $3 — a $17 difference. At 100M output tokens/month, the gap is $200 vs $30, or $170/month in savings. For high-throughput applications — document processing, classification pipelines, chat products with real user volume — that cost difference is meaningful budget. Llama 4 Scout's pricing puts it near the floor of the market ($0.30/M output vs a market range of $0.10–$25.00), while Devstral Medium sits in the lower-mid tier. The cost premium for Devstral Medium is only justifiable if your workload is heavily agentic and the planning gap (4 vs 2 in our tests) directly affects your output quality.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if your primary workload is agentic: autonomous agents that decompose goals, recover from failures, and chain steps over multiple turns. Its score of 4 on agentic planning (rank 16 of 54) versus Llama 4 Scout's 2 (rank 53 of 54) is the largest gap in this comparison, and it's the use case Devstral Medium is explicitly built for. Accept the 6.7x output cost premium only if this capability is genuinely load-bearing in your system.
Choose Llama 4 Scout if you need better tool calling (4 vs 3), stronger long-context retrieval (5 vs 4, top tier vs bottom half), or higher safety calibration (2 vs 1) — and especially if cost matters at any meaningful scale. At $0.30/M output tokens, Scout is one of the most affordable models in the market. It also supports image input (text+image->text modality) and a 327,680-token context window, versus Devstral Medium's text-only, 131,072-token limit. For general-purpose use, classification, document-heavy workloads, or multimodal tasks, Scout delivers more per dollar.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.