Claude Haiku 4.5 vs Codestral 2508 for Tool Calling

Winner: Codestral 2508. In our testing both models score 5/5 on the Tool Calling task, but Codestral 2508 earns the practical edge because it scores 5 vs 4 on structured_output and has a much lower output cost ($0.9 vs $5 per mTok). Those two differences make Codestral the better choice when you need precise argument formatting and cost-efficient production. Claude Haiku 4.5 remains preferable for multi-step orchestration and higher agentic planning/safety tradeoffs (agentic_planning 5 vs 4; safety_calibration 2 vs 1), but on the narrow Tool Calling metric Codestral is the recommended winner.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

Tool Calling demands: correct function selection, precise argument formatting, and correct sequencing of calls (taskDescription: "Function selection, argument accuracy, sequencing"). Primary capabilities that matter are structured output (JSON/schema compliance), tool_choice/tools support, sequencing/agentic planning, and safety calibration to avoid unsafe tool use. In our testing both Claude Haiku 4.5 and Codestral 2508 score 5/5 on tool_calling, so the core ability to pick and invoke functions is equivalent. Supporting signals explain differences: Codestral scores 5 on structured_output vs Claude’s 4 (in our testing), which favors strict argument compliance. Claude scores higher on agentic_planning (5 vs 4) and has include_reasoning/reasoning parameters exposed, which supports multi-step orchestration and explicit rationale. Safety calibration is higher for Claude (2 vs 1), which reduces risky tool invocations. Cost and token limits also matter: Codestral’s output cost is $0.9 per mTok vs Claude’s $5 per mTok, and both expose tool_choice and tools parameters per their supported_parameters lists.

Practical Examples

When Codestral 2508 shines: - API orchestration that requires exact JSON arguments (webhook payloads, schema-validated RPC): structured_output 5 vs 4 means fewer formatting repairs in our tests. - High-volume, low-latency tool calling (batching API calls or automated scaffolding) where $0.9 vs $5 per mTok output materially reduces cost. When Claude Haiku 4.5 shines: - Complex, multi-step tool workflows that need decomposition and recovery (agentic_planning 5 vs 4 in our testing) or explicit internal reasoning via include_reasoning/ reasoning parameters. - Tooling that must be more conservative about unsafe actions: Claude’s safety_calibration is 2 vs Codestral’s 1 in our tests, so Claude rejects more risky calls. Examples grounded in scores: both models reliably select functions (tool_calling 5/5), but Codestral produces stricter schema-adherent arguments (structured_output 5 vs 4) while Claude is stronger at planning and safety.

Bottom Line

For Tool Calling, choose Claude Haiku 4.5 if you need stronger multi-step orchestration, explicit reasoning parameters, or slightly better safety calibration (agentic_planning 5; safety_calibration 2). Choose Codestral 2508 if you prioritize strict structured outputs and cost-efficiency — it scores structured_output 5 vs Claude’s 4 and costs $0.9 vs $5 per mTok output.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions