Claude Haiku 4.5 vs Devstral Small 1.1 for Tool Calling
Winner: Claude Haiku 4.5. In our 12-test suite focused on Tool Calling, Claude Haiku 4.5 scores 5 vs Devstral Small 1.1's 4 on the tool_calling test. External benchmarks are not available for this task, so this verdict is based on our internal task score and supporting proxies. Claude Haiku 4.5 is tied for 1st on tool_calling in our rankings (tied with 16 other models) and shows stronger agentic_planning (5 vs 2), faithfulness (5 vs 4), and long_context (5 vs 4), which together indicate more reliable function selection, argument accuracy, and sequencing. Devstral Small 1.1 is substantially less expensive (input/output cost per mTok: 0.1/0.3 vs Claude Haiku 4.5's 1/5) and remains a solid 4/5 performer if cost is the primary constraint.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Task Analysis
What Tool Calling demands: accurate function selection, precise argument construction, correct sequencing (ordering of calls), and schema-compliant outputs. Our benchmarkDescription defines tool_calling as "Function selection, argument accuracy, sequencing." External benchmark data is not present for this comparison (externalBenchmark: null), so the primary signal is our internal task score: Claude Haiku 4.5 = 5, Devstral Small 1.1 = 4. Supporting internal metrics explain why: structured_output is equal (4 vs 4), so both handle schema compliance similarly; agentic_planning and long_context are substantially stronger for Claude Haiku 4.5 (agentic_planning 5 vs 2; long_context 5 vs 4), which supports multi-step orchestration and context-aware argument assembly. Both models list tool_choice, tools, and structured_outputs in supported parameters, so both can be integrated into tool-calling pipelines. Safety calibration is equal (2 vs 2), so refusal/permit behavior is similar in our tests. Context window and modality differ: Claude Haiku 4.5 has a 200k token window and supports text+image->text, while Devstral Small 1.1 has a 131k window and text->text only—advantages that favor Claude Haiku 4.5 for longer or multimodal tool workflows.
Practical Examples
High-complexity orchestration: Use Claude Haiku 4.5 when you need multi-step API orchestration with argument dependencies and recovery logic. Evidence: tool_calling 5 vs 4, agentic_planning 5 vs 2, long_context 5 vs 4. Example: chaining database queries, external API calls, and conditional retries where sequencing and state matter. Cost tradeoff for high throughput: Choose Devstral Small 1.1 when you need acceptable tool-calling accuracy at low cost. Evidence: tool_calling 4 vs 5 and much lower token costs (input 0.1 and output 0.3 per mTok vs Claude Haiku 4.5's 1 and 5). Example: high-volume webhooks or simple single-call tool selection where budget dominates. Schema-heavy integrations: Both models score structured_output = 4, so either can produce JSON/schema-compliant tool arguments reliably in our tests. Multimodal / long-context pipelines: Prefer Claude Haiku 4.5 for workflows that include large context windows or image-derived inputs (200k context, modality text+image->text vs Devstral’s 131k and text->text). Safety-sensitive gating: Both have safety_calibration = 2 in our tests, so expect similar refusal behavior and add external guardrails if required.
Bottom Line
For Tool Calling, choose Claude Haiku 4.5 if you need the highest reliability in function selection, sequencing, and multi-step orchestration (our test: 5 vs 4). Choose Devstral Small 1.1 if you need good tool-calling accuracy at a much lower cost (input/output per mTok: 0.1/0.3 vs Claude Haiku 4.5's 1/5).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.