Grok 3 vs Grok 4.20
xai
Grok 3
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
xai
Grok 4.20
Benchmark Scores
External Benchmarks
Pricing
Input
$2.00/MTok
Output
$6.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Grok 4.20 wins three benchmarks, Grok 3 wins two, and seven benchmarks tie. Details: Tool calling — Grok 4.20 scores 5 vs Grok 3’s 4; Grok 4.20 is tied for 1st on tool calling ("tied for 1st with 16 other models"), while Grok 3 sits at rank 18 (many models share that score). For constrained rewriting Grok 4.20 scores 4 vs Grok 3’s 3 (rank 6 vs rank 31), meaning Grok 4.20 is measurably better at hard compression and strict character limits. Creative problem solving favors Grok 4.20 (4 vs 3; rank 9 vs 30), indicating stronger ideation and non-obvious solutions. Grok 3 wins safety calibration (2 vs 1; rank 12 of 55 vs Grok 4.20 rank 32), so in our testing Grok 3 more reliably rejects harmful requests while permitting legitimate ones. Grok 3 also scores higher on agentic planning (5 vs 4; tied for 1st vs rank 16), showing better goal decomposition and failure recovery under our tests. The remaining seven benchmarks tie: structured output (5/5), strategic analysis (5/5), faithfulness (5/5), classification (4/4), long context (5/5), persona consistency (5/5), and multilingual (5/5) — both models rank tied for 1st in those categories in our testing. Practically, this means both models are equally reliable for long-context retrieval, format-adherent outputs, faithfulness to sources, multilingual output, and classification tasks, while Grok 4.20 clearly pulls ahead for tool integration, content compression, and creative ideation and Grok 3 retains advantages for safety-sensitive and complex planning tasks.
Pricing Analysis
Direct per-mTok prices from the payload: Grok 3 input $3 / output $15; Grok 4.20 input $2 / output $6. Using a common convention (mTok = 1,000 tokens), 1M tokens = 1,000 mTok: Grok 3 ≈ $18,000 per 1M tokens ((3+15)*1,000); Grok 4.20 ≈ $8,000 per 1M tokens ((2+6)*1,000). At 10M tokens/month Grok 3 ≈ $180,000 vs Grok 4.20 ≈ $80,000; at 100M it's ≈ $1,800,000 vs $800,000. The output price dominates (Grok 3 output $15 vs Grok 4.20 $6), so high-volume applications, startups, or embedded products should care — Grok 4.20 cuts raw token spend by ~56% at these volumes. If absolute per-response fidelity for high-risk content (safety/agentic planning) is critical, Grok 3’s higher cost may be justified; otherwise, Grok 4.20 offers far better price-to-performance for scale.
Real-World Cost Comparison
Bottom Line
Choose Grok 3 if: you need stricter safety calibration and the strongest agentic planning in our tests (safety calibration 2 vs 1; agentic planning 5 vs 4), and you can absorb ~2.5× higher token spend. Typical cases: high-risk moderation workloads, mission-critical planning agents, or compliance-focused enterprise pipelines. Choose Grok 4.20 if: you need best-in-class tool calling (5 vs 4), better constrained-rewriting and creative problem solving (4 vs 3), multimodal inputs (text+image+file->text), and a much larger context window (2,000,000 vs 131,072) at a lower cost. Typical cases: developer toolchains, large-codebase assistants, high-volume production apps, and multimodal pipelines.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.