Does an external SWE-bench Verified score decide the winner?

SWE-bench Verified (Epoch AI) is present in the dataset but neither model has a reported score, so our winner call is based on our internal structured_output and tool_calling tests.

Which model produces more schema-compliant JSON for tests and CI?

Codestral 2508: it scores 5/5 on structured_output vs Claude Haiku 4.5 at 4/5 in our testing, meaning fewer schema fixes are required downstream.

Are both models equally good at choosing and sequencing function calls (tool calling)?

Yes. In our tests both models score 5/5 on tool_calling, so they tie on function selection and argument accuracy.

How do costs compare for heavy code-generation workloads?

Codestral 2508 is far cheaper in the dataset: input_cost_per_mtok 0.3 and output_cost_per_mtok 0.9 vs Claude Haiku 4.5 at 1 and 5 respectively, making Codestral more economical for high-volume tasks.

Which model is better for inventing novel algorithms or tricky debugging?

Claude Haiku 4.5 scores higher on creative_problem_solving (4 vs 2) and strategic_analysis (5 vs 2) in our tests, so it is preferable for inventive algorithm design and nuanced tradeoff reasoning.

Claude Haiku 4.5 vs Codestral 2508 for Coding

Winner: Codestral 2508. The Coding task in our suite is primarily measured by structured_output (JSON/schema adherence) and tool_calling; Codestral scores 5 vs Claude Haiku 4 on structured_output and ties 5–5 on tool_calling. SWE-bench Verified (Epoch AI) is present in the dataset but neither model has a score, so this verdict is based on our internal test items and supporting proxies. Codestral also has a much lower output cost (0.9 vs 5 per mTok), making it the practical choice when schema fidelity and price matter. Claude Haiku 4.5 remains stronger at strategic analysis (5 vs 2), creative problem solving (4 vs 2) and agentic planning (5 vs 4) in our tests, so it is preferable for complex algorithm design, multi-step decomposition, or when higher refusal/safety calibration (2 vs 1) is needed.

anthropic

Claude Haiku 4.5

Overall

4.33/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

4/5

Safety Calibration

2/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

3/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall

3.50/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

4/5

Tool Calling

5/5

Classification

3/5

Agentic Planning

4/5

Structured Output

5/5

Safety Calibration

1/5

Strategic Analysis

2/5

Persona Consistency

3/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

What Coding demands: correctness of generated code, strict adherence to machine-readable formats (structured_output), correct function/tool selection and argument sequencing (tool_calling), faithfulness to given code and tests, and long-context recall for large codebases. Our task tests (structured_output and tool_calling) are the primary signals for Coding here. SWE-bench Verified (Epoch AI) is listed in the dataset but provides no scores for these two models, so we rely on our internal benchmarks. In our testing: Codestral 2508 scores 5/5 on structured_output and 5/5 on tool_calling (ranked tied for 1st), while Claude Haiku 4.5 scores 4/5 on structured_output and 5/5 on tool_calling. Supporting metrics: both models score 5/5 on faithfulness and long_context, but Haiku outperforms on strategic_analysis (5 vs 2), creative_problem_solving (4 vs 2), agentic_planning (5 vs 4) and persona_consistency (5 vs 3). Safety_calibration favors Haiku (2 vs 1). Cost and context: Codestral has a larger context window (256k vs 200k) and much lower input/output costs (input 0.3 vs 1; output 0.9 vs 5 per mTok), which matters for high-volume code generation and CI automation.

Practical Examples

When Codestral 2508 shines (based on scores and costs):

Generating CI test reports or machine-validated JSON outputs: structured_output 5 vs 4 means fewer schema fixes and faster pipeline integration. In our testing, Codestral produced fully compliant structures more often than Haiku.
High-volume code churn (auto-fix, FIM, test generation): identical tool_calling 5/5 but much lower output cost (0.9 vs 5 per mTok) makes Codestral far cheaper for repeated runs.
Fill-in-the-middle and deterministic format-sensitive tasks: Codestral’s 5/5 structured_output rank supports tighter format adherence. When Claude Haiku 4.5 shines:
Complex algorithm design, tradeoff reasoning, and multi-step refactors: strategic_analysis 5 vs 2 and agentic_planning 5 vs 4 indicate Haiku gives stronger decomposition and rationale in our tests.
Creative or non-obvious solutions and code recommendations: creative_problem_solving 4 vs 2—Haiku better at proposing feasible, less obvious approaches in our suite.
Safety-sensitive environments or stricter refusal behavior: safety_calibration 2 vs 1 favors Haiku in our testing. Common ground:
Both score 5/5 on tool_calling and long_context, so both are reliable at selecting and sequencing function calls and handling large contexts in our tests.

Bottom Line

For Coding, choose Claude Haiku 4.5 if you need stronger strategic analysis, creative problem solving, multi-step plan decomposition, or slightly better safety calibration (Haiku: strategic_analysis 5, creative_problem_solving 4, safety_calibration 2). Choose Codestral 2508 if you prioritize schema/format fidelity and cost efficiency (Codestral: structured_output 5 vs 4, tool_calling tied 5, output cost 0.9 vs 5 per mTok), or if you run large-scale automated code generation and CI pipelines.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

For coding tasks, we supplement our benchmark suite with SWE-bench scores from Epoch AI, an independent research organization.

Claude Haiku 4.5 vs Codestral 2508 for Coding

Claude Haiku 4.5

Codestral 2508

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions

Does an external SWE-bench Verified score decide the winner?

Which model produces more schema-compliant JSON for tests and CI?

Are both models equally good at choosing and sequencing function calls (tool calling)?

How do costs compare for heavy code-generation workloads?

Which model is better for inventing novel algorithms or tricky debugging?