Claude Sonnet 4.6 vs Codestral 2508

Claude Sonnet 4.6 is the better choice for high-stakes, multilingual, safety-sensitive, and creative workflows — it wins 7 of 12 tests including safety_calibration (5 vs 1) and creative_problem_solving (5 vs 2). Codestral 2508 wins on structured_output (5 vs 4) and is the cost-efficient pick for high-volume, schema-focused tasks given its much lower pricing ($0.3/$0.9 vs $3/$15 per 1K tokens).

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Benchmark Analysis

Head-to-head by test (scores from our 12-test suite):

  • strategic_analysis: Claude Sonnet 4.6 5 vs Codestral 2508 2 — Sonnet wins; Sonnet ranks 1 of 54 (tied with 25 others) while Codestral ranks 44 of 54. This matters for nuanced tradeoff reasoning and numeric decision-making.
  • creative_problem_solving: 5 vs 2 — Sonnet wins and ranks tied 1st (7 others); Codestral ranks 47 of 54. Expect Sonnet to generate more non-obvious, feasible ideas.
  • classification: 4 vs 3 — Sonnet wins, tied for 1st (29 others); Codestral is mid-table (rank 31 of 53). Sonnet is better for routing and accurate labeling.
  • safety_calibration: 5 vs 1 — Sonnet decisively wins, tied for 1st; Codestral ranks 32 of 55. For refusal/allow decisions and reducing harmful outputs, Sonnet is strongly superior.
  • persona_consistency: 5 vs 3 — Sonnet wins, tied for 1st; Codestral is low (rank 45). Sonnet better resists prompt injection and keeps a consistent character.
  • agentic_planning: 5 vs 4 — Sonnet wins (tied 1st); Codestral is solid (rank 16). Sonnet is preferable for goal decomposition and failure recovery.
  • multilingual: 5 vs 4 — Sonnet wins and is tied for 1st; Codestral is mid-ranked (36 of 55). For non-English output Sonnet offers higher parity.
  • structured_output: 4 vs 5 — Codestral wins and is tied for 1st (24 others); Sonnet is mid (rank 26). If strict JSON/schema compliance is the priority, Codestral holds the edge.
  • constrained_rewriting: tie 3 vs 3 — both rank 31; neither is advantaged on tight compression tasks.
  • tool_calling: tie 5 vs 5 — both tied for 1st; both models select functions and arguments well in our tests.
  • faithfulness: tie 5 vs 5 — both tied for 1st; both stick to source material in our suite.
  • long_context: tie 5 vs 5 — both tied for 1st; both maintain retrieval accuracy at 30K+ tokens.
    Additional external results: Claude Sonnet 4.6 scores 75.2% on SWE-bench Verified and 85.8% on AIME 2025 (Epoch AI) — supplementary external measures that align with Sonnet’s coding and math strengths. Codestral 2508 has no external SWE-bench/AIME entries in the payload. Overall, Claude Sonnet 4.6 wins 7 tests vs Codestral 2508’s 1, with 4 ties.
BenchmarkClaude Sonnet 4.6Codestral 2508
Faithfulness5/55/5
Long Context5/55/5
Multilingual5/54/5
Tool Calling5/55/5
Classification4/53/5
Agentic Planning5/54/5
Structured Output4/55/5
Safety Calibration5/51/5
Strategic Analysis5/52/5
Persona Consistency5/53/5
Constrained Rewriting3/53/5
Creative Problem Solving5/52/5
Summary7 wins1 wins

Pricing Analysis

Per the payload, Claude Sonnet 4.6 charges input $3 / output $15 per 1K tokens; Codestral 2508 charges input $0.3 / output $0.9 per 1K tokens. Using a 50/50 input/output split as an example: for 1M tokens (1,000 mTok) Claude costs $9,000 (500mTok$3 + 500mTok$15) vs Codestral $600 (500*$0.3 + 500*$0.9). At 10M tokens/month Claude is $90,000 vs Codestral $6,000; at 100M tokens/month Claude is $900,000 vs Codestral $60,000. The payload also reports a price ratio of 16.6667:1, so Sonnet 4.6 is ~16.7× more expensive per token. Teams with tight budgets or very high throughput (bots, logging, automated test generation at scale) should prefer Codestral 2508; teams that need top-tier safety, multilingual support, creative outputs, or agentic planning should weigh Sonnet 4.6’s higher cost against its benchmark advantages.

Real-World Cost Comparison

TaskClaude Sonnet 4.6Codestral 2508
iChat response$0.0081<$0.001
iBlog post$0.032$0.0020
iDocument batch$0.810$0.051
iPipeline run$8.10$0.510

Bottom Line

Choose Claude Sonnet 4.6 if: you need top safety calibration, multilingual parity, creative problem solving, strategic analysis, or agentic planning — Sonnet scores 5 in these areas and wins 7 of 12 benchmarks; you can justify higher spend (input $3/output $15 per 1K tokens).
Choose Codestral 2508 if: you require strict structured outputs and schema compliance (Codestral scores 5 and is tied for 1st), are optimizing for low latency or very high token volumes, or need a dramatically cheaper model (input $0.3/output $0.9 per 1K tokens). Codestral is the pragmatic choice for high-frequency coding, FIM, and schema-first workloads.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions