Codestral 2508 vs Devstral 2 2512

Devstral 2 2512 is the better pick for the majority of benchmarked tasks in our testing — it wins 5 of 12 benchmarks, notably on constrained_rewriting (5 vs 3) and creative_problem_solving (4 vs 2). Codestral 2508 wins on tool_calling and faithfulness and is substantially cheaper (about 45% of Devstral’s per-mTok output cost), so choose it when throughput and cost matter.

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

mistral

Devstral 2 2512

Overall
4.00/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
4/5
Constrained Rewriting
5/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Below are the 12 benchmark comparisons from our testing with scores and ranking context, and what each difference means in practice: 1) tool_calling — Codestral 2508: 5 (tied for 1st out of 54) vs Devstral 2 2512: 4 (rank 18 of 54). Practical: Codestral is stronger at function selection and argument accuracy for automated tool or API calls. 2) faithfulness — Codestral: 5 (tied for 1st of 55) vs Devstral: 4 (rank 34 of 55). Practical: Codestral sticks to source material more reliably in our tests, reducing hallucinated outputs. 3) constrained_rewriting — Codestral: 3 (rank 31 of 53) vs Devstral: 5 (tied for 1st). Practical: Devstral is substantially better at squeezing content into tight character/format limits (e.g., microcopy, SMS). 4) creative_problem_solving — Codestral: 2 (rank 47 of 54) vs Devstral: 4 (rank 9 of 54). Practical: Devstral generates more non-obvious, feasible ideas in brainstorming and design tasks. 5) strategic_analysis — Codestral: 2 (rank 44 of 54) vs Devstral: 4 (rank 27 of 54). Practical: Devstral is stronger at nuanced tradeoff reasoning and multi-step numeric analysis. 6) persona_consistency — Codestral: 3 (rank 45 of 53) vs Devstral: 4 (rank 38 of 53). Practical: Devstral holds character and resists injection better in multi-turn persona-driven flows. 7) multilingual — Codestral: 4 (rank 36 of 55) vs Devstral: 5 (tied for 1st). Practical: Devstral produces higher parity across non-English outputs in our tests. 8) structured_output — both: 5 (tied for 1st). Practical: Both models adhere to JSON/schema constraints reliably. 9) classification — both: 3 (tie; rank 31 of 53). Practical: Neither has a decisive edge on routing/categorization in our suite. 10) long_context — both: 5 (tied for 1st). Practical: Both handle 30K+ token retrieval tasks effectively per our tests. 11) safety_calibration — both: 1 (tie; rank 32 of 55). Practical: Both models are conservative on safety calibration in our benchmarks. 12) agentic_planning — both: 4 (tie; rank 16 of 54). Practical: Both decompose goals and handle recovery similarly. Summary: Devstral wins five tests (strategic_analysis, constrained_rewriting, creative_problem_solving, persona_consistency, multilingual); Codestral wins two (tool_calling, faithfulness); five tests tie. These results are from our 12-test suite and the ranking positions above show where differences matter for real tasks.

BenchmarkCodestral 2508Devstral 2 2512
Faithfulness5/54/5
Long Context5/55/5
Multilingual4/55/5
Tool Calling5/54/5
Classification3/53/5
Agentic Planning4/54/5
Structured Output5/55/5
Safety Calibration1/51/5
Strategic Analysis2/54/5
Persona Consistency3/54/5
Constrained Rewriting3/55/5
Creative Problem Solving2/54/5
Summary2 wins5 wins

Pricing Analysis

Pricing in the payload is expressed per mTok. Using the provided rates (mTok = 1,000 tokens): Codestral 2508 charges $0.30 input / $0.90 output per mTok; Devstral 2 2512 charges $0.40 input / $2.00 output per mTok. If you assume a 50/50 split of input vs output tokens, cost per 1M total tokens: Codestral = $600 (0.3500 + 0.9500 = $150 + $450), Devstral = $1,200 (0.4500 + 2.0500 = $200 + $1,000). Scale linearly: for 10M tokens/month (50/50) Codestral = $6,000 vs Devstral = $12,000; for 100M tokens/month Codestral = $60,000 vs Devstral = $120,000. If your workload is output-heavy (more generated tokens), Devstral’s $2.00/mTok output rate drives larger gaps: per 1M output tokens alone, Codestral = $900 vs Devstral = $2,000 (difference $1,100). The payload’s priceRatio (0.45) aligns with this: Codestral costs ~45% of Devstral on output unit pricing. Teams with high-volume, latency-sensitive code generation should prefer Codestral to save thousands monthly; teams that need the extra reasoning/creative strengths of Devstral may accept the higher bill.

Real-World Cost Comparison

TaskCodestral 2508Devstral 2 2512
iChat response<$0.001$0.0011
iBlog post$0.0020$0.0042
iDocument batch$0.051$0.108
iPipeline run$0.510$1.08

Bottom Line

Choose Codestral 2508 if you need the best tool calling and strict faithfulness at high throughput and lower cost — it's tied for 1st on tool_calling and faithfulness and costs ~45% of Devstral on output mTok. Choose Devstral 2 2512 if your priority is creative problem solving, strategic analysis, constrained rewriting (tight-character work), or multilingual outputs — it wins 5 of 12 benchmarks and is tied for top in constrained_rewriting and multilingual capabilities in our tests. If budget is the primary constraint and you generate many output tokens, Codestral is the pragmatic choice; if capability for hard reasoning or cross-language quality is essential, invest in Devstral.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions