Claude Haiku 4.5 vs Devstral Small 1.1 for Tool Calling

Winner: Claude Haiku 4.5. In our 12-test suite focused on Tool Calling, Claude Haiku 4.5 scores 5 vs Devstral Small 1.1's 4 on the tool_calling test. External benchmarks are not available for this task, so this verdict is based on our internal task score and supporting proxies. Claude Haiku 4.5 is tied for 1st on tool_calling in our rankings (tied with 16 other models) and shows stronger agentic_planning (5 vs 2), faithfulness (5 vs 4), and long_context (5 vs 4), which together indicate more reliable function selection, argument accuracy, and sequencing. Devstral Small 1.1 is substantially less expensive (input/output cost per mTok: 0.1/0.3 vs Claude Haiku 4.5's 1/5) and remains a solid 4/5 performer if cost is the primary constraint.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

What Tool Calling demands: accurate function selection, precise argument construction, correct sequencing (ordering of calls), and schema-compliant outputs. Our benchmarkDescription defines tool_calling as "Function selection, argument accuracy, sequencing." External benchmark data is not present for this comparison (externalBenchmark: null), so the primary signal is our internal task score: Claude Haiku 4.5 = 5, Devstral Small 1.1 = 4. Supporting internal metrics explain why: structured_output is equal (4 vs 4), so both handle schema compliance similarly; agentic_planning and long_context are substantially stronger for Claude Haiku 4.5 (agentic_planning 5 vs 2; long_context 5 vs 4), which supports multi-step orchestration and context-aware argument assembly. Both models list tool_choice, tools, and structured_outputs in supported parameters, so both can be integrated into tool-calling pipelines. Safety calibration is equal (2 vs 2), so refusal/permit behavior is similar in our tests. Context window and modality differ: Claude Haiku 4.5 has a 200k token window and supports text+image->text, while Devstral Small 1.1 has a 131k window and text->text only—advantages that favor Claude Haiku 4.5 for longer or multimodal tool workflows.

Practical Examples

High-complexity orchestration: Use Claude Haiku 4.5 when you need multi-step API orchestration with argument dependencies and recovery logic. Evidence: tool_calling 5 vs 4, agentic_planning 5 vs 2, long_context 5 vs 4. Example: chaining database queries, external API calls, and conditional retries where sequencing and state matter. Cost tradeoff for high throughput: Choose Devstral Small 1.1 when you need acceptable tool-calling accuracy at low cost. Evidence: tool_calling 4 vs 5 and much lower token costs (input 0.1 and output 0.3 per mTok vs Claude Haiku 4.5's 1 and 5). Example: high-volume webhooks or simple single-call tool selection where budget dominates. Schema-heavy integrations: Both models score structured_output = 4, so either can produce JSON/schema-compliant tool arguments reliably in our tests. Multimodal / long-context pipelines: Prefer Claude Haiku 4.5 for workflows that include large context windows or image-derived inputs (200k context, modality text+image->text vs Devstral’s 131k and text->text). Safety-sensitive gating: Both have safety_calibration = 2 in our tests, so expect similar refusal behavior and add external guardrails if required.

Bottom Line

For Tool Calling, choose Claude Haiku 4.5 if you need the highest reliability in function selection, sequencing, and multi-step orchestration (our test: 5 vs 4). Choose Devstral Small 1.1 if you need good tool-calling accuracy at a much lower cost (input/output per mTok: 0.1/0.3 vs Claude Haiku 4.5's 1/5).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions