Is Codestral 2508 better than Devstral 2 2512?

Not overall. In our testing Devstral 2 2512 wins 5 of 12 benchmarks while Codestral 2508 wins 2 (tool_calling and faithfulness); five benchmarks tie. Pick based on which specific tests matter to your workload.

Which model is cheaper to run?

Codestral 2508 is cheaper. Per the payload: Codestral input $0.30 / output $0.90 per mTok; Devstral input $0.40 / output $2.00 per mTok. With a 50/50 input/output split, per 1M total tokens Codestral ≈ $600 vs Devstral ≈ $1,200.

Which is better for coding and tool integration?

Codestral 2508 is stronger for tool calling (score 5, tied for 1st out of 54 in our ranks) and faithfulness (5, tied for 1st), so it’s preferable for precise code completion, FIM, and correct function/argument selection. Devstral retains strengths in agentic planning (tie) and constrained tasks but scores lower on tool_calling in our tests.

Which model is better at creative or strategic tasks?

Devstral 2 2512 wins creative_problem_solving (4 vs Codestral’s 2) and strategic_analysis (4 vs Codestral’s 2) in our testing, and ranks higher on those dimensions (creative_problem_solving rank 9 of 54). Choose Devstral for ideation and complex tradeoff reasoning.

How do they compare for long-context and structured outputs?

They tie on both: long_context = 5 (tied for 1st) and structured_output = 5 (tied for 1st) in our tests. Expect equivalent performance on 30K+ retrieval tasks and JSON/schema adherence.

Codestral 2508 vs Devstral 2 2512

Devstral 2 2512 is the better pick for the majority of benchmarked tasks in our testing — it wins 5 of 12 benchmarks, notably on constrained_rewriting (5 vs 3) and creative_problem_solving (4 vs 2). Codestral 2508 wins on tool_calling and faithfulness and is substantially cheaper (about 45% of Devstral’s per-mTok output cost), so choose it when throughput and cost matter.

mistral

Codestral 2508

Overall

3.50/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

4/5

Tool Calling

5/5

Classification

3/5

Agentic Planning

4/5

Structured Output

5/5

Safety Calibration

1/5

Strategic Analysis

2/5

Persona Consistency

3/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

mistral

Devstral 2 2512

Overall

4.00/5Strong

Benchmark Scores

Faithfulness

4/5

Long Context

5/5

Multilingual

5/5

Tool Calling

4/5

Classification

3/5

Agentic Planning

4/5

Structured Output

5/5

Safety Calibration

1/5

Strategic Analysis

4/5

Persona Consistency

4/5

Constrained Rewriting

5/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Below are the 12 benchmark comparisons from our testing with scores and ranking context, and what each difference means in practice: 1) tool_calling — Codestral 2508: 5 (tied for 1st out of 54) vs Devstral 2 2512: 4 (rank 18 of 54). Practical: Codestral is stronger at function selection and argument accuracy for automated tool or API calls. 2) faithfulness — Codestral: 5 (tied for 1st of 55) vs Devstral: 4 (rank 34 of 55). Practical: Codestral sticks to source material more reliably in our tests, reducing hallucinated outputs. 3) constrained_rewriting — Codestral: 3 (rank 31 of 53) vs Devstral: 5 (tied for 1st). Practical: Devstral is substantially better at squeezing content into tight character/format limits (e.g., microcopy, SMS). 4) creative_problem_solving — Codestral: 2 (rank 47 of 54) vs Devstral: 4 (rank 9 of 54). Practical: Devstral generates more non-obvious, feasible ideas in brainstorming and design tasks. 5) strategic_analysis — Codestral: 2 (rank 44 of 54) vs Devstral: 4 (rank 27 of 54). Practical: Devstral is stronger at nuanced tradeoff reasoning and multi-step numeric analysis. 6) persona_consistency — Codestral: 3 (rank 45 of 53) vs Devstral: 4 (rank 38 of 53). Practical: Devstral holds character and resists injection better in multi-turn persona-driven flows. 7) multilingual — Codestral: 4 (rank 36 of 55) vs Devstral: 5 (tied for 1st). Practical: Devstral produces higher parity across non-English outputs in our tests. 8) structured_output — both: 5 (tied for 1st). Practical: Both models adhere to JSON/schema constraints reliably. 9) classification — both: 3 (tie; rank 31 of 53). Practical: Neither has a decisive edge on routing/categorization in our suite. 10) long_context — both: 5 (tied for 1st). Practical: Both handle 30K+ token retrieval tasks effectively per our tests. 11) safety_calibration — both: 1 (tie; rank 32 of 55). Practical: Both models are conservative on safety calibration in our benchmarks. 12) agentic_planning — both: 4 (tie; rank 16 of 54). Practical: Both decompose goals and handle recovery similarly. Summary: Devstral wins five tests (strategic_analysis, constrained_rewriting, creative_problem_solving, persona_consistency, multilingual); Codestral wins two (tool_calling, faithfulness); five tests tie. These results are from our 12-test suite and the ranking positions above show where differences matter for real tasks.

BenchmarkCodestral 2508Devstral 2 2512

Faithfulness5/54/5

Long Context5/55/5

Multilingual4/55/5

Tool Calling5/54/5

Classification3/53/5

Agentic Planning4/54/5

Structured Output5/55/5

Safety Calibration1/51/5

Strategic Analysis2/54/5

Persona Consistency3/54/5

Constrained Rewriting3/55/5

Creative Problem Solving2/54/5

Summary2 wins5 wins

Pricing Analysis

Pricing in the payload is expressed per mTok. Using the provided rates (mTok = 1,000 tokens): Codestral 2508 charges $0.30 input / $0.90 output per mTok; Devstral 2 2512 charges $0.40 input / $2.00 output per mTok. If you assume a 50/50 split of input vs output tokens, cost per 1M total tokens: Codestral = $600 (0.3500 + 0.9500 = $150 + $450), Devstral = $1,200 (0.4500 + 2.0500 = $200 + $1,000). Scale linearly: for 10M tokens/month (50/50) Codestral = $6,000 vs Devstral = $12,000; for 100M tokens/month Codestral = $60,000 vs Devstral = $120,000. If your workload is output-heavy (more generated tokens), Devstral’s $2.00/mTok output rate drives larger gaps: per 1M output tokens alone, Codestral = $900 vs Devstral = $2,000 (difference $1,100). The payload’s priceRatio (0.45) aligns with this: Codestral costs ~45% of Devstral on output unit pricing. Teams with high-volume, latency-sensitive code generation should prefer Codestral to save thousands monthly; teams that need the extra reasoning/creative strengths of Devstral may accept the higher bill.

Real-World Cost Comparison

TaskCodestral 2508Devstral 2 2512

iChat response<$0.001$0.0011

iBlog post$0.0020$0.0042

iDocument batch$0.051$0.108

iPipeline run$0.510$1.08

Bottom Line

Choose Codestral 2508 if you need the best tool calling and strict faithfulness at high throughput and lower cost — it's tied for 1st on tool_calling and faithfulness and costs ~45% of Devstral on output mTok. Choose Devstral 2 2512 if your priority is creative problem solving, strategic analysis, constrained rewriting (tight-character work), or multilingual outputs — it wins 5 of 12 benchmarks and is tied for top in constrained_rewriting and multilingual capabilities in our tests. If budget is the primary constraint and you generate many output tokens, Codestral is the pragmatic choice; if capability for hard reasoning or cross-language quality is essential, invest in Devstral.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.