R1 0528 vs GPT-5.4 for Creative Writing
Winner: GPT-5.4. In our testing both models tie on the Creative Writing composite (4.3333 vs 4.3333), but GPT-5.4 is the safer, more reliable choice for fiction tasks that require constrained rewriting or strict structured outputs because R1 0528 has a documented quirk that can return empty responses on constrained_rewriting and structured_output. GPT-5.4 also scores 5/5 on safety_calibration and structured_output in our tests versus R1’s 4/5 on structured_output and 4/5 on safety_calibration. If cost is the primary driver, R1 0528 is far cheaper (output cost $2.15 per mTok vs GPT-5.4 $15 per mTok), but reliability and rewrite/format guarantees make GPT-5.4 the practical winner for Creative Writing workflows that demand predictable, non-empty outputs.
deepseek
R1 0528
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$2.15/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
Creative Writing (fiction, storytelling, creative content) demands: persona_consistency (hold character and voice), creative_problem_solving (novel ideas, plot developments), and constrained_rewriting (compressing or rewriting to tight length/format rules). Key LLM capabilities: strong persona_consistency, long_context handling for serial work, reliable constrained_rewriting and structured_output (for formatted scenes, subtitles, or publication-ready snippets), and safety_calibration to avoid inappropriate content. In our testing the primary task composite is tied: taskScore R1 0528 = 4.3333, GPT-5.4 = 4.3333 and both rank 5 of 52. Supporting internal signals differ: both models score 5/5 on persona_consistency and long_context in our tests; constrained_rewriting and creative_problem_solving are 4/5 for both. Where they diverge: GPT-5.4 scores 5/5 on structured_output and 5/5 on safety_calibration, whereas R1 0528 scores 4/5 on structured_output and 4/5 on safety_calibration. Crucially, R1 0528’s quirks note that it “Returns empty responses on structured_output, constrained_rewriting, and agentic_planning,” which directly impacts Creative Writing tasks that rely on constrained rewrites or strict formats. Cost and tooling matter too: R1 0528 is materially cheaper (input $0.5/mTok, output $2.15/mTok) and scores 5/5 on tool_calling in our tests (helpful for tool-assisted workflows), while GPT-5.4 supports multimodal input and a much larger context window (1,050,000 tokens) but at higher costs (input $2.5/mTok, output $15/mTok).
Practical Examples
Where GPT-5.4 shines: 1) Tight microfiction rewrites — GPT-5.4 scored 5/5 on structured_output and 5/5 on safety_calibration in our testing, so it reliably produces non-empty, correctly formatted 280-character flash fiction suitable for publication. 2) Serialized novel drafting with strict scene formatting — GPT-5.4’s 1,050,000-token context window (and 5/5 long_context) makes it better for multi-chapter continuity and long-form edits. Where R1 0528 shines: 1) High-volume idea generation or iterative drafts where cost matters — R1 output cost $2.15/mTok vs GPT-5.4 $15/mTok, with identical Creative Writing composite (4.33) so you can produce many drafts cheaply. 2) Tool-driven creative workflows — R1 scored 5/5 on tool_calling in our testing, useful when you call external style/style-checking or publishing tools. Caveat: for tasks that require constrained rewriting or strict structured JSON outputs, R1 0528’s documented quirk can return empty responses on constrained_rewriting and structured_output, making GPT-5.4 the safer pick for final-format outputs.
Bottom Line
For Creative Writing, choose R1 0528 if cost and tool-driven iteration are primary: output cost $2.15/mTok, strong persona_consistency (5/5), long_context (5/5), and tool_calling (5/5) in our tests. Choose GPT-5.4 if you need guaranteed, publication-ready constrained rewrites or strict structured outputs and stronger safety/format reliability: GPT-5.4 scored 5/5 on structured_output and safety_calibration in our tests and avoids R1’s empty-output quirk, but expect higher costs ($15/mTok output). Both models tie on the overall Creative Writing composite (4.3333 vs 4.3333), so pick based on reliability needs vs budget.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.