xai

Grok 4

Grok 4 is xAI's reasoning model featuring a 256,000-token context window, multimodal input (text, image, and file), parallel tool calling, and structured output support. It sits at the top of xAI's lineup, above Grok 3, Grok 4.20, and Grok 4.1 Fast, and is designed for tasks that require sustained reasoning over long documents, rigorous analysis, and high-fidelity output.

In the broader market, Grok 4 priced at $15/M output tokens puts it directly alongside Claude Sonnet 4.6 ($15/M output) and GPT-5.4 ($15/M output) — both of which score higher on our overall benchmarks (4.67 and 4.58 average respectively, vs. Grok 4's rank of 27 of 52 overall). It is not the top scorer at this price point, which makes the pricing case harder to make on raw averages alone. Where Grok 4 does distinguish itself is in specific high-value capabilities: strategic analysis, faithfulness, and multilingual output, each scoring 5/5 in our testing.

Performance

Grok 4 has not yet received an overall benchmark grade in our suite, but individual test scores paint a clear picture of where it excels and where it falls short.

Top strengths:

  1. Strategic analysis (5/5): Tied for 1st among 26 models out of 54 tested in our nuanced tradeoff reasoning tasks. This is Grok 4's most distinctive result — it handles scenarios requiring real-number reasoning and multi-factor tradeoffs at the highest level we measure.

  2. Faithfulness (5/5): Tied for 1st among 33 models out of 55 tested. In our testing, Grok 4 sticks closely to source material without hallucinating — a critical property for document summarization, legal review, and any task where accuracy to source content matters.

  3. Multilingual (5/5): Tied for 1st among 35 models out of 55 tested. Grok 4 produces equivalent-quality output across non-English languages in our benchmark, making it suitable for global deployment scenarios.

Long context (5/5, tied 1st of 55) and persona consistency (5/5, tied 1st of 53) round out strong scores, with structured output (4/5), tool calling (4/5), classification (4/5), and constrained rewriting (4/5) all at or above the field median.

Notable weaknesses:

  • Agentic planning (3/5): Ranked 42 of 54 in goal decomposition and failure recovery — a significant gap relative to its other scores. Models intended for autonomous multi-step workflows should be evaluated carefully here.
  • Creative problem solving (3/5): Ranked 30 of 54. Non-obvious ideation is not a strength.
  • Safety calibration (2/5): Ranked 12 of 55 but at a low absolute score — Grok 4 falls in the bottom quartile of tested models for correctly refusing harmful requests while permitting legitimate ones.

Overall rank is 27 of 52, which reflects how the weaker scores on agentic planning and safety pull down an otherwise strong top-end performance profile.

Pricing

Grok 4 costs $3.00 per million input tokens and $15.00 per million output tokens via the API.

At typical developer usage volumes, that translates to roughly:

  • Light use (500K tokens/month, 80% input): ~$1.40/month
  • Moderate use (5M tokens/month, 70% input): ~$31.50/month
  • Heavy use (50M tokens/month, 60% input): ~$390/month

The output cost of $15/M is shared with Claude Sonnet 4.6 and GPT-5.4, both of which score higher on our average benchmark. Claude Opus 4.6 is more expensive at $25/M output. On the cheaper end within this performance bracket, Grok 4.20 (a sibling model from xAI) costs only $6/M output and scores 4.33 average — substantially less per token if you don't need Grok 4's specific strengths.

Grok 4 uses reasoning tokens, which means complex tasks will consume additional tokens beyond the visible prompt and completion. Developers should budget for this overhead, particularly on multi-step reasoning or agentic workflows, as actual costs can exceed naive estimates based on prompt length alone.

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

Real-World Costs

iChat response$0.0081
iBlog post$0.032
iDocument batch$0.810
iPipeline run$8.10

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="x-ai/grok-4",
    messages=[
        {"role": "user", "content": "Hello, Grok 4!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Use Grok 4 if:

  • Your primary tasks involve deep strategic or analytical reasoning — Grok 4 scores 5/5 on strategic analysis in our testing, making it a strong fit for business intelligence, research synthesis, and scenario modeling.
  • You need faithful document processing at scale. A 5/5 faithfulness score and a 256K context window together make Grok 4 well-suited for legal document review, contract analysis, and long-form summarization where hallucination is unacceptable.
  • You're building multilingual applications. The 5/5 multilingual score puts it among the top performers for non-English output quality.
  • You're already working within the xAI ecosystem and want the highest-capability model in the lineup.

Look elsewhere if:

  • You need strong agentic behavior. Grok 4's 3/5 agentic planning score (rank 42 of 54) is a real liability for autonomous task execution. Claude Sonnet 4.6 (avg 4.67, same $15/M output cost) or GPT-5.4 (avg 4.58, same $15/M output) are better choices for agent-heavy workloads.
  • Safety calibration matters for your deployment. A 2/5 safety calibration score is among the lower results in our test suite — this is a meaningful concern for consumer-facing or regulated applications.
  • You want maximum value per dollar at this capability tier. Grok 4.20 at $6/M output scores 4.33 average across our benchmarks; if Grok 4's specific 5/5 strengths don't map directly to your use case, the sibling model offers substantially lower cost.
  • Creative brainstorming or idea generation is central to your workflow. A 3/5 creative problem solving score (rank 30 of 54) puts better options at lower prices.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions