Claude Sonnet 4.6 vs GPT-5.4 for Creative Problem Solving
Winner: Claude Sonnet 4.6. In our testing Sonnet 4.6 scores 5/5 vs GPT-5.4's 4/5 on Creative Problem Solving (non‑obvious, specific, feasible ideas). Sonnet leads on tool calling (5 vs 4), agentic planning (5 vs 5 tie), strategic analysis (5 vs 5 tie), faithfulness (5 vs 5 tie) and ranks #1 for the task (rank 1 of 52) while GPT-5.4 ranks 9 of 52. No single external benchmark is designated as primary on this page; when looking at external probes, GPT-5.4 posts higher AIME and slightly higher SWE-bench scores (AIME 95.3 vs 85.8, SWE-bench Verified 76.9 vs 75.2 according to Epoch AI), which can matter for math- or code‑centric problem solving. Overall, for open-ended creative idea generation that benefits from iterative tool use and actionable plans, Sonnet 4.6 is the better choice in our benchmarks.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
What Creative Problem Solving demands: generation of non‑obvious, specific, and feasible ideas; credible feasibility checks; decomposition into workable steps; adaptation under new constraints; and clear, machine‑usable outputs for follow‑up (e.g., code, experiments, or plans). Key capabilities that matter: - Idea diversity and novelty (creative_problem_solving score). - Tool calling and agentic planning for iterative exploration and prototyping. - Strategic analysis to trade off feasibility, cost, and risk. - Structured output when ideas must map to precise formats (JSON, checklists). - Faithfulness to avoid hallucinated constraints or false assumptions. In our testing Sonnet 4.6 posts a 5/5 on creative_problem_solving (rank 1/52) and a 5 on tool_calling and agentic_planning, supporting iterative, grounded ideation. GPT-5.4 scores 4/5 on creative_problem_solving but scores higher on structured_output (5 vs Sonnet's 4), making it stronger when you need exact schemas or rigid deliverables. External probes (Epoch AI): SWE-bench Verified — GPT-5.4 76.9% vs Sonnet 75.2% — and AIME 2025 — GPT-5.4 95.3% vs Sonnet 85.8% — offer supplementary evidence that GPT-5.4 is relatively stronger on formal mathematical and coding reasoning, but those are supporting signals, not the primary Creative Problem Solving outcome in our suite.
Practical Examples
Where Claude Sonnet 4.6 shines (based on scores): - Iterative product ideation that uses tools or APIs to prototype concepts: Sonnet 4.6 (tool_calling 5, agentic_planning 5) will better select functions, sequence calls, and refine ideas through cycles. - Open‑ended strategy sessions requiring tradeoffs and novel angles: Sonnet's creative_problem_solving 5 and strategic_analysis 5 produce non‑obvious, feasible options. - Multilingual or persona‑driven brainstorming: Sonnet scores 5 on multilingual and persona_consistency, preserving nuance across languages and roles. Where GPT-5.4 is preferable: - Deliverables that require strict schema adherence (product specs, exact JSON tasks): GPT-5.4 structured_output 5 vs Sonnet 4. - Analytic, math‑heavy problem solving where external probes matter: GPT-5.4 posts higher AIME (95.3% vs 85.8%) and slightly higher SWE-bench Verified (76.9% vs 75.2%) per Epoch AI, so for formal proofs or rigorous numerical derivations GPT-5.4 can be stronger. Cost & context tradeoffs: both models share large context windows (~1M tokens) and identical output_cost_per_mtok (15); Sonnet input_cost_per_mtok is 3 vs GPT‑5.4's 2.5, so iterative workflows with heavy input may cost slightly more on Sonnet.
Bottom Line
For Creative Problem Solving, choose Claude Sonnet 4.6 if you need non‑obvious, actionable ideas with strong tool integration, iterative prototyping, and high task rank (Sonnet = 5/5, rank 1/52). Choose GPT-5.4 if you need strict, schema‑accurate outputs or higher performance on formal math/code probes (GPT-5.4 = 4/5 on creative problem solving but stronger structured_output and higher AIME/SWE-bench scores according to Epoch AI).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.