Claude Haiku 4.5 vs Devstral Medium for Tool Calling

Winner: Claude Haiku 4.5. In our testing on the Tool Calling task Claude Haiku 4.5 scored 5 vs Devstral Medium's 3 (a 2-point advantage) and ranks 1 vs 46 of 52 for this task. Haiku's higher agentic_planning (5 vs 4), faithfulness (5 vs 4), long_context (5 vs 4), and tool_calling (5 vs 3) scores explain its superior function selection, argument accuracy, and sequencing in multi-step workflows. No external benchmark is provided for this task, so the verdict is based on our internal 12-test suite results.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

What Tool Calling demands: accurate function selection, precise argument formatting, correct sequencing of multi-step calls, and robustness across long interactions. Primary capabilities that matter are: structured_output (JSON/schema adherence), agentic_planning (decomposing goals and ordering calls), faithfulness (avoiding hallucinated arguments), and long_context (keeping state across many calls). Both models expose tool-related parameters and structured_outputs in our data. Claude Haiku 4.5 outperforms Devstral Medium on the task_score (5 vs 3) and related proxies: agentic_planning 5 vs 4, faithfulness 5 vs 4, and long_context 5 vs 4—these explain why Haiku is better at chaining calls, filling exact argument schemas, and recovering from intermediate failures. Structured_output is tied at 4 for both models, so pure JSON formatting alone is not the differentiator; the advantage comes from Haiku's stronger planning, faithfulness, and larger context window (200,000 vs 131,072 tokens).

Practical Examples

Where Claude Haiku 4.5 shines (based on score gaps):

  • Multi-step API orchestration: Haiku's agentic_planning 5 and long_context 5 help reliably sequence dozens of dependent calls and preserve argument state across a 200k-token context. (Tool Calling 5 vs 3.)
  • Precise argument generation for complex schemas: Haiku's faithfulness 5 reduces malformed or invented fields when filling API inputs even when schemas are nested.
  • Recovery and branching logic: Haiku better decomposes failures and issues corrective tool calls due to its higher agentic_planning and tool_calling scores. Where Devstral Medium is appropriate (grounded in scores and cost):
  • Simple, high-throughput tool invocations: Devstral scores a 3 on Tool Calling and matches Haiku on structured_output (4), making it acceptable for straightforward single-step calls and JSON output.
  • Cost-sensitive batch jobs: Devstral Medium is cheaper per mTok (input 0.4, output 2) versus Haiku (input 1, output 5), so it reduces spend for large-volume, low-complexity tool calls.
  • Smaller-context agentic workflows: Devstral’s 131,072 token window and agentic_planning 4 support moderate chaining where extreme context depth or complex recovery is not required.

Bottom Line

For Tool Calling, choose Claude Haiku 4.5 if you need reliable multi-step orchestration, high faithfulness in argument generation, and long-context state (Haiku scored 5 vs Devstral's 3 in our tests). Choose Devstral Medium if you prioritize lower per-token cost (input 0.4 / output 2 vs Haiku 1 / 5) and your tool flows are single-step or modestly complex with smaller context needs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions