Codestral 2508 vs Devstral Small 1.1
In our testing Codestral 2508 is the better pick for developer workflows and heavy-code agent use — it wins 6 of 12 benchmarks (tool calling, faithfulness, structured output, long context, agentic planning, persona consistency). Devstral Small 1.1 is a strong cost-saving alternative that wins classification and safety_calibration and is priced at one-third the input/output cost of Codestral.
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
We ran both models across our 12-test suite and report wins/losses/ties below (all claims are from our testing). Wins: Codestral — structured_output 5 vs 4, tool_calling 5 vs 4, faithfulness 5 vs 4, long_context 5 vs 4, agentic_planning 4 vs 2, persona_consistency 3 vs 2. Context: structured_output measures JSON/schema compliance (Codestral ties for 1st with 24 others out of 54); tool_calling tests function selection and sequencing (Codestral tied for 1st with 16 others out of 54), and faithfulness checks hallucination risk (Codestral tied for 1st with 32 others out of 55). These results mean Codestral provides higher reliability when you need precise structured outputs, long-context retrieval at 30K+ tokens, accurate function args, and conservative adherence to source material — all important for FIM, test generation, and multi-step code agents. Wins: Devstral — classification 4 vs 3 and safety_calibration 2 vs 1. Classification is Devstral's strength (tied for 1st with 29 others out of 53), which matters for routing, intent detection, and label-heavy automation; safety_calibration (Devstral rank 12 of 55) means it refuses or permits content more safely in our tests. Ties (no clear winner): strategic_analysis (2/2), constrained_rewriting (3/3), creative_problem_solving (2/2), multilingual (4/4). Practical takeaway: pick Codestral when you need higher-fidelity code/tool workflows and long contexts; pick Devstral when classification accuracy and slightly better safety behavior matter and budget is a constraint.
Pricing Analysis
Pricing (per million tokens) from the payload: Codestral 2508 input $0.30 / output $0.90; Devstral Small 1.1 input $0.10 / output $0.30 — a 3× price ratio. Assuming a 50/50 split of input/output tokens, per 1M total tokens Codestral costs $0.60 vs Devstral $0.20. At 10M tokens/month that's $6.00 vs $2.00; at 100M tokens/month $60.00 vs $20.00. Teams running high-volume agents, CI test generation, or large-context pipelines will see the difference scale linearly and should budget the extra $40/month per 100M tokens if choosing Codestral; small teams, prototypes, or cost-sensitive production routes should prefer Devstral for the same throughput.
Real-World Cost Comparison
Bottom Line
Choose Codestral 2508 if you build code generation agents, need reliable tool calling, strict JSON/structured outputs, or depend on long-context (30K+) retrieval — it scored 5 in tool_calling, structured_output, faithfulness and tied for 1st on several of those tests in our testing. Choose Devstral Small 1.1 if you must minimize runtime costs or prioritize classification and safety calibration (Devstral scores 4 in classification and 2 in safety_calibration) — it costs roughly one-third per-token and is better for high-volume, budget-sensitive classification pipelines.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.