Codestral 2508 vs Devstral 2 2512
Devstral 2 2512 is the better pick for the majority of benchmarked tasks in our testing — it wins 5 of 12 benchmarks, notably on constrained_rewriting (5 vs 3) and creative_problem_solving (4 vs 2). Codestral 2508 wins on tool_calling and faithfulness and is substantially cheaper (about 45% of Devstral’s per-mTok output cost), so choose it when throughput and cost matter.
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
mistral
Devstral 2 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Benchmark Analysis
Below are the 12 benchmark comparisons from our testing with scores and ranking context, and what each difference means in practice: 1) tool_calling — Codestral 2508: 5 (tied for 1st out of 54) vs Devstral 2 2512: 4 (rank 18 of 54). Practical: Codestral is stronger at function selection and argument accuracy for automated tool or API calls. 2) faithfulness — Codestral: 5 (tied for 1st of 55) vs Devstral: 4 (rank 34 of 55). Practical: Codestral sticks to source material more reliably in our tests, reducing hallucinated outputs. 3) constrained_rewriting — Codestral: 3 (rank 31 of 53) vs Devstral: 5 (tied for 1st). Practical: Devstral is substantially better at squeezing content into tight character/format limits (e.g., microcopy, SMS). 4) creative_problem_solving — Codestral: 2 (rank 47 of 54) vs Devstral: 4 (rank 9 of 54). Practical: Devstral generates more non-obvious, feasible ideas in brainstorming and design tasks. 5) strategic_analysis — Codestral: 2 (rank 44 of 54) vs Devstral: 4 (rank 27 of 54). Practical: Devstral is stronger at nuanced tradeoff reasoning and multi-step numeric analysis. 6) persona_consistency — Codestral: 3 (rank 45 of 53) vs Devstral: 4 (rank 38 of 53). Practical: Devstral holds character and resists injection better in multi-turn persona-driven flows. 7) multilingual — Codestral: 4 (rank 36 of 55) vs Devstral: 5 (tied for 1st). Practical: Devstral produces higher parity across non-English outputs in our tests. 8) structured_output — both: 5 (tied for 1st). Practical: Both models adhere to JSON/schema constraints reliably. 9) classification — both: 3 (tie; rank 31 of 53). Practical: Neither has a decisive edge on routing/categorization in our suite. 10) long_context — both: 5 (tied for 1st). Practical: Both handle 30K+ token retrieval tasks effectively per our tests. 11) safety_calibration — both: 1 (tie; rank 32 of 55). Practical: Both models are conservative on safety calibration in our benchmarks. 12) agentic_planning — both: 4 (tie; rank 16 of 54). Practical: Both decompose goals and handle recovery similarly. Summary: Devstral wins five tests (strategic_analysis, constrained_rewriting, creative_problem_solving, persona_consistency, multilingual); Codestral wins two (tool_calling, faithfulness); five tests tie. These results are from our 12-test suite and the ranking positions above show where differences matter for real tasks.
Pricing Analysis
Pricing in the payload is expressed per mTok. Using the provided rates (mTok = 1,000 tokens): Codestral 2508 charges $0.30 input / $0.90 output per mTok; Devstral 2 2512 charges $0.40 input / $2.00 output per mTok. If you assume a 50/50 split of input vs output tokens, cost per 1M total tokens: Codestral = $600 (0.3500 + 0.9500 = $150 + $450), Devstral = $1,200 (0.4500 + 2.0500 = $200 + $1,000). Scale linearly: for 10M tokens/month (50/50) Codestral = $6,000 vs Devstral = $12,000; for 100M tokens/month Codestral = $60,000 vs Devstral = $120,000. If your workload is output-heavy (more generated tokens), Devstral’s $2.00/mTok output rate drives larger gaps: per 1M output tokens alone, Codestral = $900 vs Devstral = $2,000 (difference $1,100). The payload’s priceRatio (0.45) aligns with this: Codestral costs ~45% of Devstral on output unit pricing. Teams with high-volume, latency-sensitive code generation should prefer Codestral to save thousands monthly; teams that need the extra reasoning/creative strengths of Devstral may accept the higher bill.
Real-World Cost Comparison
Bottom Line
Choose Codestral 2508 if you need the best tool calling and strict faithfulness at high throughput and lower cost — it's tied for 1st on tool_calling and faithfulness and costs ~45% of Devstral on output mTok. Choose Devstral 2 2512 if your priority is creative problem solving, strategic analysis, constrained rewriting (tight-character work), or multilingual outputs — it wins 5 of 12 benchmarks and is tied for top in constrained_rewriting and multilingual capabilities in our tests. If budget is the primary constraint and you generate many output tokens, Codestral is the pragmatic choice; if capability for hard reasoning or cross-language quality is essential, invest in Devstral.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.