Best AI for Creative Writing

Creative writing is one of the most demanding tests for an AI model — and one where model choice makes the largest measurable difference. Generic capabilities like factual recall or code execution matter far less here than three specific skills: creative problem-solving (generating non-obvious, specific, and feasible ideas), persona consistency (maintaining character voice and resisting prompt injection), and constrained rewriting (hitting hard word or character limits while preserving meaning and style). A model that scores 5/5 on all three will produce fiction, narrative, and creative content that feels intentional and controlled. A model that scores 3/5 on constrained rewriting will truncate your ending or ignore your format requirements.

Our rankings are based on our own 12-test benchmark suite, scored 1–5. The three tests directly relevant to creative writing — creative_problem_solving, persona_consistency, and constrained_rewriting — determine the task score. The overall average across all 12 tests provides secondary context on general model quality. No external benchmark (such as SWE-bench Verified or AIME 2025) was designated as the primary signal for this task; our internal creative writing proxy tests are the primary evidence. Where models in the payload also carry external benchmark scores, we reference them as supplementary data points.

Our Pick

openai

GPT-5.2

Overall
4.67/5Strong

Pricing

Input

$1.75/MTok

Output

$14.00/MTok

Context Window400K

modelpicker.net

Results

The creative writing rankings reveal a three-way tie at the top and a surprisingly crowded field overall. Three models share the highest task score of 4.67/5 in our testing: GPT-5.2, Gemini 3.1 Pro Preview, and Gemini 3 Flash Preview. All three scored 5/5 on both creative_problem_solving and persona_consistency. The differentiator is constrained_rewriting, where all three scored 4/5 — meaning none of the top-tier models achieved a perfect sweep on that harder compression test.

Within the tied group, output cost becomes the deciding factor per our ranking methodology. GPT-5.2 costs $14/MTok output, Gemini 3.1 Pro Preview costs $12/MTok, and Gemini 3 Flash Preview costs just $3/MTok. GPT-5.2 and Gemini 3.1 Pro Preview deliver the same creative writing task score at a price premium over Flash — making Gemini 3 Flash Preview the standout value pick among the top scorers.

The next tier (score 4.33/5) is enormous — 22 models share it, including Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, GPT-5.1, and DeepSeek R1. Within this group, the creative writing sub-scores tell a more nuanced story. Claude Opus 4.6 scored 5/5 on creative_problem_solving and persona_consistency but only 3/5 on constrained_rewriting — the same pattern as Claude Sonnet 4.6. DeepSeek R1 scored 5/5 on creative_problem_solving and persona_consistency, also with 4/5 on constrained_rewriting, but at $2.50/MTok output it is considerably cheaper than the Anthropic options. GPT-4.1 is an interesting outlier in this tier: it scored 5/5 on constrained_rewriting (the highest of any model in the dataset) but only 3/5 on creative_problem_solving — making it the best choice specifically for format-constrained work like fixed-form poetry or tight copy, but a weaker pick for open-ended fiction.

Devstral 2 2512, our designated budget pick, also sits at 4.33/5 with a notable 5/5 on constrained_rewriting and $2.00/MTok output — the cheapest model at this score tier, and a strong choice for developers running high-volume creative generation pipelines via LLM API.

On supplementary external benchmarks: several models in the payload carry AIME 2025 scores (Epoch AI), which measure mathematical reasoning rather than creative fluency — these are not predictive of creative writing performance and are included for completeness only. GPT-5.2 scores 96.1% on AIME 2025 and 73.8% on SWE-bench Verified (Epoch AI); Gemini 3 Flash Preview scores 92.8% on AIME 2025 and 75.4% on SWE-bench Verified (Epoch AI). These figures confirm both are strong general-purpose models, but they do not change the creative writing ranking.

Budget Guide

For the highest creative writing task score in our testing, use GPT-5.2, Gemini 3.1 Pro Preview, or Gemini 3 Flash Preview — all tied at 4.67/5. Among those three, Gemini 3 Flash Preview delivers identical benchmark performance at $3/MTok output versus $14/MTok for GPT-5.2 and $12/MTok for Gemini 3.1 Pro Preview. If you need the top score and want to minimize cost, Gemini 3 Flash Preview is the clear choice.

For the 4.33/5 tier — which covers 22 models and is one step below the top — the best cost-quality balance is DeepSeek R1 at $2.50/MTok output (5/5 creative_problem_solving, 5/5 persona_consistency, 4/5 constrained_rewriting) or Devstral 2 2512 at $2.00/MTok output (4/5 creative_problem_solving, 4/5 persona_consistency, 5/5 constrained_rewriting). Devstral 2 2512 is the cheapest model at this score level and is particularly well-suited for constrained formats. For developers prioritizing ultra-low cost with acceptable quality, DeepSeek V3.2 at $0.38/MTok output and DeepSeek V3.1 at $0.75/MTok output both reach 4.33/5 with 5/5 scores on persona_consistency.

Spend the premium on Claude Opus 4.6 ($25/MTok output) only if you specifically need its extended 1M-token context window for long-form narrative work — its creative writing task score of 4.33/5 is the same as models costing a fraction of the price.

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

Top picksOther models

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions