Claude Sonnet 4.6 vs GPT-5.4 for Tool Calling
Winner: Claude Sonnet 4.6. In our testing Claude Sonnet 4.6 scores 5/5 on Tool Calling versus GPT-5.4’s 4/5 and ranks 1 of 52 vs GPT-5.4’s 18 of 52. Sonnet 4.6 is definitively better at function selection, argument accuracy, and sequencing — the core behaviors measured by our tool_calling test. GPT-5.4 is not far behind and wins on structured_output (5 vs 4), which benefits strict JSON/schema compliance, but that advantage does not outweigh Sonnet’s lead on the end-to-end tool-calling tasks.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
What Tool Calling demands: selecting the correct tool, producing accurate arguments and types, ordering calls correctly, and recovering from intermediate failures. Capabilities that matter: high tool_calling performance, reliable structured_output/JSON compliance, robust sequencing/agentic planning, clear reasoning about API arguments, and stable refusal/safety behavior when tools should not be invoked. In our testing Sonnet 4.6 scored 5 on tool_calling (rank 1 of 52) while GPT-5.4 scored 4 (rank 18 of 52). For structured_output (JSON/schema compliance) GPT-5.4 scores 5 vs Sonnet 4.6’s 4 — so GPT-5.4 is stronger at strict format adherence. Both models expose parameters relevant to tool workflows: Sonnet lists support for response_format, structured_outputs, tool_choice and tools; GPT-5.4 supports structured_outputs, tool_choice and tools as well. Both models have very large context windows (Sonnet: 1,000,000; GPT-5.4: 1,050,000) and large max output tokens, which helps multi-step tool orchestration. As supplementary external data, Sonnet scores 75.2% and GPT-5.4 76.9% on SWE-bench Verified (Epoch AI) — a small gap that favors GPT-5.4 on some coding-resolution tasks but does not overturn our internal tool_calling result.
Practical Examples
Where Claude Sonnet 4.6 shines (based on score 5 vs 4):
- Complex agent orchestration: choosing a sequence of three different APIs where argument formats vary and results feed into subsequent calls — Sonnet’s 5/5 tool_calling score indicates it more reliably selects the right functions and sequences calls.
- Ambiguous argument resolution: user gives partial specs and the model must infer types/units and fill missing fields — Sonnet’s higher tool_calling and agentic_planning scores reduce malformed calls.
- Failure recovery: Sonnet’s top agentic_planning (5) and tool_calling (5) make it more likely to detect a failed tool call and retry with corrected args. Where GPT-5.4 shines (based on structured_output 5 vs 4 and other scores):
- Strict schema output: when every tool call requires exact JSON schema compliance (no tolerance for extra keys), GPT-5.4’s structured_output 5/5 makes it a safer pick.
- Code-heavy integrations: GPT-5.4’s slightly higher SWE-bench Verified (76.9% vs 75.2% on SWE-bench Verified, Epoch AI) and much stronger AIME score (95.3% vs 85.8%) suggest strengths on precise, technical outputs that can help when tool arguments are code-like or math-heavy. Concrete numeric anchors: Sonnet tool_calling 5 vs GPT-5.4 4; structured_output Sonnet 4 vs GPT-5.4 5; SWE-bench Verified (Epoch AI) Sonnet 75.2% vs GPT-5.4 76.9%.
Bottom Line
For Tool Calling, choose Claude Sonnet 4.6 if you need the most reliable function selection, argument accuracy and multi-step sequencing (Sonnet 4.6: 5/5, rank 1 of 52). Choose GPT-5.4 if your integrations demand exact JSON/schema compliance or highly code-like argument formatting (GPT-5.4: structured_output 5/5; tool_calling 4/5).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.