Question 1

Which model wins Persona Consistency in your tests?

Accepted Answer

Neither — Claude Sonnet 4.6 and GPT-5.4 both score 5/5 on our persona_consistency test and share rank 1 of 52. The task is a tie in our testing.

Question 2

If they tie, how should I decide which to deploy for persona-driven chat?

Accepted Answer

Decide by secondary capabilities: Sonnet 4.6 scored 5 on tool_calling (better for agentic workflows that must preserve persona during tools). GPT-5.4 scored 5 on structured_output and supports file inputs (better when strict schema or file-based persona onboarding matters). Also consider input cost (GPT-5.4 is 2.5 per mTok vs Sonnet 3 per mTok).

Question 3

Do either model have weaknesses on persona-related safety?

Accepted Answer

In our testing both models scored 5 on safety_calibration and 5 on faithfulness, indicating strong refusal of harmful requests and adherence to persona constraints on the tests we ran.

Question 4

How do context windows affect persona maintenance?

Accepted Answer

Both models scored 5 on long_context and have ~1,000,000‑token context windows in the payload, so in our tests both handle long persona histories equally well.

Question 5

Are external benchmarks influencing this verdict?

Accepted Answer

No. externalBenchmark is null in the payload, so our internal 5/5 persona_consistency scores are the basis for the verdict. We report tied top performance.

Claude Sonnet 4.6 vs GPT-5.4 for Persona Consistency

Claude Sonnet 4.6

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions