Question 1

How much better is Claude Haiku 4.5 at Tool Calling in your tests?

Accepted Answer

In our testing Haiku scored 5 on Tool Calling vs Devstral Medium's 3 — a 2-point advantage. Haiku also leads on agentic_planning (5 vs 4), faithfulness (5 vs 4), and long_context (5 vs 4), which together drive its edge on sequencing and argument accuracy.

Question 2

Do both models support structured outputs and tool parameters?

Accepted Answer

Yes. Our data shows both Claude Haiku 4.5 and Devstral Medium list supported parameters for structured_outputs, tool_choice, and tools, so both can emit schema-compliant JSON and accept tool specifications. Structured_output scores are tied at 4 in our tests.

Question 3

Is Devstral Medium ever the right choice for tool calling?

Accepted Answer

Yes. Devstral Medium is a practical pick when calls are simple or high-volume and you need lower per-token costs (input 0.4 / output 2). For straightforward single-step invocations or moderate agentic chains that don't require extreme context depth or the highest faithfulness, Devstral provides cost savings with acceptable structured output (score 4).

Question 4

Does context window size matter for Tool Calling?

Accepted Answer

Yes. Longer context windows help maintain call history, state, and complex argument buildup. Claude Haiku 4.5 provides a 200,000-token window vs Devstral Medium's 131,072, which contributes to Haiku's higher long_context (5 vs 4) and better multi-step orchestration.

Question 5

Are these results from your internal testing or third-party benchmarks?

Accepted Answer

These Tool Calling results are from our internal testing across the 12-test suite. There is no external benchmark provided for this task in the dataset, so the verdict and scores reflect our internal measurements.

Claude Haiku 4.5 vs Devstral Medium for Tool Calling

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions