xai
Grok 4.20
Grok 4.20 is xAI's flagship model, described in our data as offering industry-leading speed and agentic tool calling capabilities with a low hallucination rate and strict prompt adherence. It ranked 10th overall out of 52 tested models, making it one of the highest-performing models in our dataset. At $2/M input and $6/M output, it is priced more aggressively than most models in its performance bracket: Claude Sonnet 4.6 charges $15/M output, GPT-5.2 charges $14/M, and Claude Opus 4.6 charges $25/M — all with comparable or slightly higher average scores. Grok 4.20's 2M token context window is the largest in our dataset, enabling processing of extremely long documents. The model has no documented quirks in our payload.
Performance
In our 12-test benchmark suite, Grok 4.20's strongest dimensions are tool calling (5/5, tied for 1st with 16 others out of 54 tested), faithfulness (5/5, tied for 1st with 32 others out of 55), and structured output (5/5, tied for 1st with 24 others out of 54). It also scores 5/5 on multilingual, strategic analysis, persona consistency, and long context — an exceptionally broad set of top scores. Grok 4.20 does not have external benchmark data (SWE-bench, MATH, or AIME) in our payload. The primary weakness is safety calibration, which scored 1/5 (rank 32 of 55) — the lowest possible score and a consistent pattern across several tested models. Agentic planning and constrained rewriting both scored 4/5, solid but not among the very highest.
Pricing
Grok 4.20 costs $2 per million input tokens and $6 per million output tokens. At typical usage — 1M input + 500K output — total cost is about $5. At 10M input / 5M output per month, expect $50/month. At 100M input / 50M output, roughly $500/month. Compared to bracket peers: Claude Sonnet 4.6 ($15/M output) is 2.5x more expensive per output token. GPT-5.4 ($15/M output) and Claude Opus 4.6 ($25/M output) are even pricier. Among high-performers near rank 10, Grok 4.20's $6/M output is one of the lowest — comparable to R1 0528 ($2.15/M output, avg 4.5) on price, but Grok 4.20 ranks higher overall. No other model in the top-15 overall positions offers output at $6/M or below except R1 0528 and Gemini 3 Flash Preview ($3/M output, rank 5).
xai
Grok 4.20
Benchmark Scores
External Benchmarks
Pricing
Input
$2.00/MTok
Output
$6.00/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="x-ai/grok-4.20",
messages=[
{"role": "user", "content": "Hello, Grok 4.20!"}
],
)
print(response.choices[0].message.content)Recommendation
Grok 4.20 is a compelling choice for teams that need top-tier performance without top-tier pricing. Its 5/5 scores on tool calling, faithfulness, multilingual, structured output, and long context in our testing cover the most common enterprise use cases: agentic pipelines, document analysis, multilingual content, and structured data extraction. The $6/M output price is well below what comparable-performing models charge, making it especially attractive for high-volume production workloads. The 2M token context window is the largest in our dataset — useful for processing complete codebases, lengthy legal documents, or long research corpora. Who should look elsewhere: if safety calibration is a first-order requirement (Grok 4.20 scored 1/5), seek alternatives with higher scores. If math/coding benchmark performance is a deciding factor, note that Grok 4.20 has no external benchmark data in our payload — models like GPT-5 (MATH Level 5: 98.1) or Gemini 3 Flash Preview (AIME 2025: 92.8, per Epoch AI) offer verified external scores.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.