Grok 4 vs Grok 4.20

For most production and agentic use cases choose Grok 4.20: it wins more head-to-head tests (4 vs Grok 4's 1), is stronger at tool calling and structured output, and is substantially cheaper. Choose Grok 4 only if you prioritize its slightly stronger safety calibration score and are willing to pay a premium.

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

xai

Grok 4.20

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$2.00/MTok

Output

$6.00/MTok

Context Window2000K

modelpicker.net

Benchmark Analysis

Head-to-head summary from our 12-test suite: Grok 4.20 wins 4 tests (structured output 5 vs 4, creative problem solving 4 vs 3, tool calling 5 vs 4, agentic planning 4 vs 3). Grok 4 wins safety calibration (2 vs 1). Seven tests tie. Detailed walk-through:

  • Tool calling: Grok 4.20 scores 5 vs Grok 4's 4. In our rankings Grok 4.20 is tied for 1st (tied with 16 others out of 54) while Grok 4 ranks 18 of 54. This matters for function selection, argument accuracy and sequencing — Grok 4.20 is the safer pick for multi-step agent workflows.
  • Structured output: Grok 4.20 5 vs Grok 4 4; Grok 4.20 is tied for 1st (with 24 others) vs Grok 4 at rank 26. For strict JSON/schema compliance, Grok 4.20 produces more reliably formatted outputs.
  • Creative problem solving: Grok 4.20 4 vs Grok 4 3; Grok 4.20 ranks 9 of 54 vs Grok 4 at rank 30. If you need non-obvious, feasible ideas, Grok 4.20 performs better in our tests.
  • Agentic planning: Grok 4.20 4 vs Grok 4 3; Grok 4.20 ranks 16 of 54 vs Grok 4 at 42. For goal decomposition and failure recovery, Grok 4.20 shows stronger planning behavior.
  • Safety calibration: Grok 4 leads 2 vs 1; Grok 4 ranks 12 of 55 vs Grok 4.20 at 32. If your highest priority is refuse/permit accuracy in risky prompts, Grok 4 scored higher in our safety calibration test.
  • Ties: strategic analysis (5), constrained rewriting (4), faithfulness (5), classification (4), long context (5), persona consistency (5), multilingual (5). Both models tie on many core capabilities. Notably both scored 5 on long context and faithfulness and are tied for top ranks in those categories (long context tied for 1st; faithfulness tied for 1st). Additional context: Grok 4 has a 256,000 token window; Grok 4.20 has a 2,000,000 token window (payload values). Both scored 5 on long context in our tests, but Grok 4.20's larger window makes it better suited to extremely large documents or multi-document retrieval pipelines.
BenchmarkGrok 4Grok 4.20
Faithfulness5/55/5
Long Context5/55/5
Multilingual5/55/5
Tool Calling4/55/5
Classification4/54/5
Agentic Planning3/54/5
Structured Output4/55/5
Safety Calibration2/51/5
Strategic Analysis5/55/5
Persona Consistency5/55/5
Constrained Rewriting4/54/5
Creative Problem Solving3/54/5
Summary1 wins4 wins

Pricing Analysis

Costs in the payload are per mTok. Assuming 1 mTok = 1,000 tokens, Grok 4 charges $3 input / $15 output per mTok while Grok 4.20 charges $2 input / $6 output per mTok. If you send 1M input tokens and receive 1M output tokens (1:1 split = 1,000 mTok each): Grok 4 costs $3,000 (input) + $15,000 (output) = $18,000; Grok 4.20 costs $2,000 + $6,000 = $8,000, a $10,000 monthly savings. At 10M tokens (1:1) multiply those by 10: $180,000 vs $80,000 (save $100,000). At 100M tokens (1:1) it's $1,800,000 vs $800,000 (save $1,000,000). The output-rate ratio matches the payload's priceRatio of 2.5 (Grok 4 output $15 / Grok 4.20 output $6). If you are a high-volume API user (10M+ tokens/month) the cost gap is material; small-scale testers or hobbyists will see small absolute differences but should still note the 2.5x output cost gap.

Real-World Cost Comparison

TaskGrok 4Grok 4.20
iChat response$0.0081$0.0034
iBlog post$0.032$0.013
iDocument batch$0.810$0.340
iPipeline run$8.10$3.40

Bottom Line

Choose Grok 4.20 if you need cheaper inference at scale, best-in-class tool calling and structured outputs, stronger creative problem solving, or agentic planning (it wins 4 tests vs Grok 4's 1). Use cases: production agents, function-calling orchestration, heavy-document assistants, and high-volume APIs. Choose Grok 4 if your top priority is slightly better safety calibration and you can accept a much higher per-token bill (Grok 4 output $15/mTok vs Grok 4.20 $6/mTok). Use cases: niche safety-sensitive tasks where that single-point safety improvement (score 2 vs 1) matters more than cost or tooling.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions