Codestral 2508 vs Llama 4 Maverick
Codestral 2508 is the stronger choice for API and agentic development work, winning on tool calling (5 vs. unscored), structured output (5 vs. 4), faithfulness (5 vs. 4), long context (5 vs. 4), and agentic planning (4 vs. 3) in our testing. Llama 4 Maverick counters with better persona consistency (5 vs. 3), safety calibration (2 vs. 1), and creative problem solving (3 vs. 2), plus native image input support that Codestral 2508 lacks entirely. At $0.30/$0.90 per MTok input/output vs. Llama 4 Maverick's $0.15/$0.60, Codestral 2508 costs 2× as much — a premium that's justified for coding-focused agentic pipelines, but hard to defend for general-purpose or multimodal use.
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
meta
Llama 4 Maverick
Benchmark Scores
External Benchmarks
Pricing
Input
$0.150/MTok
Output
$0.600/MTok
modelpicker.net
Benchmark Analysis
Across the 11 tests where both models were scored in our testing (Llama 4 Maverick's tool calling result was excluded due to a rate limit hit on OpenRouter on 2026-04-13), Codestral 2508 wins 5, Llama 4 Maverick wins 3, and 4 are tied.
Where Codestral 2508 wins:
- Faithfulness (5 vs. 4): Codestral 2508 tied for 1st of 55 models; Llama 4 Maverick ranks 34th. In RAG pipelines or summarization tasks, Codestral 2508 is meaningfully less likely to hallucinate beyond its source material.
- Structured output (5 vs. 4): Codestral 2508 tied for 1st of 54; Llama 4 Maverick ranks 26th. For applications requiring reliable JSON schema compliance — function results, API integrations, data pipelines — Codestral 2508 is the safer choice.
- Tool calling (5 vs. unscored): Codestral 2508 tied for 1st of 54 models on function selection, argument accuracy, and sequencing. Llama 4 Maverick's result is missing due to a transient rate limit; treat this as a data gap, not a confirmed weakness, but Codestral 2508's top-tier result is strong independent evidence.
- Long context (5 vs. 4): Codestral 2508 tied for 1st of 55; Llama 4 Maverick ranks 38th. Codestral 2508 also holds a hard technical edge here: 256K context window vs. Llama 4 Maverick's 1M window, but Codestral 2508's retrieval accuracy at 30K+ tokens is demonstrably higher in our tests.
- Agentic planning (4 vs. 3): Codestral 2508 ranks 16th of 54; Llama 4 Maverick ranks 42nd. Goal decomposition and failure recovery are materially better with Codestral 2508 — important for multi-step autonomous agents.
Where Llama 4 Maverick wins:
- Persona consistency (5 vs. 3): Llama 4 Maverick tied for 1st of 53; Codestral 2508 ranks 45th. For chatbot personas, character-driven applications, or any system that needs to resist prompt injection and stay in character, Llama 4 Maverick is clearly superior.
- Safety calibration (2 vs. 1): Llama 4 Maverick ranks 12th of 55; Codestral 2508 ranks 32nd. Both scores are below the field median of 2, but Llama 4 Maverick handles the balance of refusing harmful requests while permitting legitimate ones noticeably better. This matters for consumer-facing deployments.
- Creative problem solving (3 vs. 2): Llama 4 Maverick ranks 30th of 54; Codestral 2508 ranks 47th. Neither model is strong here relative to the field, but Llama 4 Maverick generates more non-obvious, feasible ideas.
Tied tests (both score identically):
- Strategic analysis: both score 2/5, tied at rank 44 of 54 — a shared weakness.
- Constrained rewriting: both score 3/5, tied at rank 31 of 53.
- Classification: both score 3/5, tied at rank 31 of 53.
- Multilingual: both score 4/5, tied at rank 36 of 55.
Neither model has external benchmark data (SWE-bench Verified, AIME 2025, MATH Level 5) in the payload, so no third-party coding or math comparisons are available.
Pricing Analysis
Codestral 2508 costs $0.30 per million input tokens and $0.90 per million output tokens. Llama 4 Maverick costs $0.15 input and $0.60 output — exactly half the input price and one-third less on output. In practice: at 1M output tokens/month, Codestral 2508 costs $0.90 vs. $0.60 for Llama 4 Maverick — a $0.30 gap that's negligible. At 10M output tokens/month, the gap widens to $3.00 ($9.00 vs. $6.00). At 100M output tokens/month — a realistic volume for a production code assistant or autocomplete service — you're paying $90 vs. $60, a $30/month difference. That's modest in absolute terms, but teams running high-frequency fill-in-the-middle (FIM) workloads at massive scale will feel it. Developers who can use Llama 4 Maverick's multimodal capabilities (image + text) get more capability per dollar. Teams with pure-text coding pipelines where tool calling and structured output fidelity matter most will find Codestral 2508's premium reasonable.
Real-World Cost Comparison
Bottom Line
Choose Codestral 2508 if you're building agentic coding pipelines, IDE autocomplete (FIM), code correction tools, or any system that depends on reliable tool calling, structured JSON output, and faithful source adherence. Its top-tier scores on those dimensions — tied for 1st of 54 on tool calling and structured output in our tests — make it a purpose-built fit. The 256K context window is ample for most codebases. Pay the $0.30/$0.90 per MTok rate only if these capabilities are core to your workflow.
Choose Llama 4 Maverick if you need multimodal input (image + text), stronger safety calibration for consumer-facing products, or persona-consistent chat applications. At $0.15/$0.60 per MTok, it's half the input cost and handles creative tasks and character consistency better. Its 1M token context window also makes it the only option here for truly massive document analysis. Teams that don't specifically need Codestral 2508's coding specialization will get more flexibility per dollar from Llama 4 Maverick.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.