Claude Sonnet 4.6 vs GPT-5.4 for Tool Calling

Winner: Claude Sonnet 4.6. In our testing Claude Sonnet 4.6 scores 5/5 on Tool Calling versus GPT-5.4’s 4/5 and ranks 1 of 52 vs GPT-5.4’s 18 of 52. Sonnet 4.6 is definitively better at function selection, argument accuracy, and sequencing — the core behaviors measured by our tool_calling test. GPT-5.4 is not far behind and wins on structured_output (5 vs 4), which benefits strict JSON/schema compliance, but that advantage does not outweigh Sonnet’s lead on the end-to-end tool-calling tasks.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Task Analysis

What Tool Calling demands: selecting the correct tool, producing accurate arguments and types, ordering calls correctly, and recovering from intermediate failures. Capabilities that matter: high tool_calling performance, reliable structured_output/JSON compliance, robust sequencing/agentic planning, clear reasoning about API arguments, and stable refusal/safety behavior when tools should not be invoked. In our testing Sonnet 4.6 scored 5 on tool_calling (rank 1 of 52) while GPT-5.4 scored 4 (rank 18 of 52). For structured_output (JSON/schema compliance) GPT-5.4 scores 5 vs Sonnet 4.6’s 4 — so GPT-5.4 is stronger at strict format adherence. Both models expose parameters relevant to tool workflows: Sonnet lists support for response_format, structured_outputs, tool_choice and tools; GPT-5.4 supports structured_outputs, tool_choice and tools as well. Both models have very large context windows (Sonnet: 1,000,000; GPT-5.4: 1,050,000) and large max output tokens, which helps multi-step tool orchestration. As supplementary external data, Sonnet scores 75.2% and GPT-5.4 76.9% on SWE-bench Verified (Epoch AI) — a small gap that favors GPT-5.4 on some coding-resolution tasks but does not overturn our internal tool_calling result.

Practical Examples

Where Claude Sonnet 4.6 shines (based on score 5 vs 4):

  • Complex agent orchestration: choosing a sequence of three different APIs where argument formats vary and results feed into subsequent calls — Sonnet’s 5/5 tool_calling score indicates it more reliably selects the right functions and sequences calls.
  • Ambiguous argument resolution: user gives partial specs and the model must infer types/units and fill missing fields — Sonnet’s higher tool_calling and agentic_planning scores reduce malformed calls.
  • Failure recovery: Sonnet’s top agentic_planning (5) and tool_calling (5) make it more likely to detect a failed tool call and retry with corrected args. Where GPT-5.4 shines (based on structured_output 5 vs 4 and other scores):
  • Strict schema output: when every tool call requires exact JSON schema compliance (no tolerance for extra keys), GPT-5.4’s structured_output 5/5 makes it a safer pick.
  • Code-heavy integrations: GPT-5.4’s slightly higher SWE-bench Verified (76.9% vs 75.2% on SWE-bench Verified, Epoch AI) and much stronger AIME score (95.3% vs 85.8%) suggest strengths on precise, technical outputs that can help when tool arguments are code-like or math-heavy. Concrete numeric anchors: Sonnet tool_calling 5 vs GPT-5.4 4; structured_output Sonnet 4 vs GPT-5.4 5; SWE-bench Verified (Epoch AI) Sonnet 75.2% vs GPT-5.4 76.9%.

Bottom Line

For Tool Calling, choose Claude Sonnet 4.6 if you need the most reliable function selection, argument accuracy and multi-step sequencing (Sonnet 4.6: 5/5, rank 1 of 52). Choose GPT-5.4 if your integrations demand exact JSON/schema compliance or highly code-like argument formatting (GPT-5.4: structured_output 5/5; tool_calling 4/5).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions