Claude Sonnet 4.6 vs GPT-5.4 for Persona Consistency

Tie — In our testing both Claude Sonnet 4.6 and GPT-5.4 achieve the top Persona Consistency score (5/5) and share rank 1 of 52. Neither model outscored the other on the persona_consistency test; choose based on secondary strengths (tool calling, structured output, file modality, and pricing) rather than persona score itself.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Task Analysis

Persona Consistency demands (per our benchmarkDescriptions) that a model maintain character and resist injection. Key capabilities that matter are: safety_calibration (refusing harmful or irrelevant persona overrides), long_context (tracking persona across 30K+ tokens), faithfulness (sticking to the defined character), tool_calling (preserving persona during tool interactions), and structured_output (ensuring persona-aligned fields in schemas). With no external benchmark provided, our internal task scores are primary evidence: both models scored 5/5 on persona_consistency and are tied at rank 1. Supporting internal signals show differences useful for deployment decisions: Sonnet 4.6 scores 5 on tool_calling (helpful when agent/tool sequences must preserve persona) while GPT-5.4 scores 5 on structured_output (helpful when strict schema compliance must reflect persona). Both models score 5 on safety_calibration, faithfulness, and long_context in our tests — all core abilities for resisting persona injection.

Practical Examples

  1. Multi-step agent with persona-bound tool calls — Sonnet 4.6 shines: persona_consistency 5 plus tool_calling 5 means it kept character across function selection and arguments in our tests. 2) API that must return strict JSON blocks with persona fields — GPT-5.4 shines: both models are 5/5 on persona_consistency, but GPT-5.4 has structured_output 5 (vs Sonnet's 4), so it performed better on schema adherence while preserving persona. 3) Long chat history and role-play — both models perform equally: persona_consistency 5 and long_context 5. 4) File-based onboarding of persona documents — prefer GPT-5.4 because its modality is text+image+file->text (Sonnet is text+image->text). 5) Cost-sensitive input-heavy workloads — GPT-5.4 has lower input cost (2.5 per mTok vs Sonnet 3 per mTok); both have equal output cost (15 per mTok). Use these concrete score differences when mapping to your product flow.

Bottom Line

For Persona Consistency, choose Claude Sonnet 4.6 if you need best-in-class tool calling while preserving persona (Sonnet: persona_consistency 5, tool_calling 5). Choose GPT-5.4 if you need strict structured output or file-based persona onboarding (GPT-5.4: persona_consistency 5, structured_output 5, modality includes file inputs). Both models tie on core persona metrics in our tests, so pick by integration, schema, tool workflow, or input-cost tradeoffs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions