Question 1

Do Haiku 4.5 and Sonnet 4.6 differ on raw tool_calling accuracy in our tests?

Accepted Answer

No — in our testing both Claude Haiku 4.5 and Claude Sonnet 4.6 score 5/5 on the tool_calling benchmark and share the top rank for this task.

Question 2

Why is Sonnet 4.6 declared the winner if tool_calling is tied?

Accepted Answer

We chose Sonnet 4.6 because supporting signals that matter for tool calling—safety_calibration (5 vs 2) and creative_problem_solving (5 vs 4)—are stronger in our tests, and Sonnet has an external SWE-bench Verified score of 75.2% (Epoch AI) that supplements its reliability for function selection and sequencing.

Question 3

How do costs compare when running tool-calling workloads?

Accepted Answer

Claude Haiku 4.5 is substantially cheaper: input_cost_per_mtok 1 and output_cost_per_mtok 5 versus Claude Sonnet 4.6 at input 3 and output 15 per m-token. For high-volume simple tool calls, Haiku reduces billable tokens.

Question 4

Which model is better for long, stateful tool orchestrations?

Accepted Answer

Sonnet 4.6: it has a larger context_window (1,000,000 tokens vs Haiku’s 200,000), equal agentic_planning (5/5), and stronger safety/creative signals in our testing, which help in long, multi-step agent workflows.

Question 5

Is there any third-party benchmark evidence in the payload?

Accepted Answer

Yes—Claude Sonnet 4.6 includes a SWE-bench Verified score of 75.2% (Epoch AI) in the payload; that external datapoint is supplementary to our internal scores and supports Sonnet’s reliability for tool/coding tasks. Claude Haiku 4.5 has no SWE-bench score included in the payload.

Claude Haiku 4.5 vs Claude Sonnet 4.6 for Tool Calling

Claude Haiku 4.5

Claude Sonnet 4.6

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions