xai

Grok 3

Grok 3 is xAI’s general-purpose flagship, positioned for enterprise workloads: data extraction, long-document analysis, structured pipelines, and multilingual applications. At $3/$15 per million tokens (input/output), it sits at the premium end of the market alongside Claude Sonnet 4.6, GPT-5.2, and GPT-5.4. Within xAI’s own lineup, Grok 3 is the mid-tier offering — above the $0.50 output Grok 3 Mini and Grok 4.1 Fast, and matching Grok 4 on price. Notably, in our benchmarks Grok 3 outscores its own newer sibling Grok 4 (4.25 average vs 4.08), making it the stronger choice among xAI models at this price point. Context window is 131,072 tokens, suitable for processing long documents or large codebases.

Performance

In our 12-test benchmark suite (scored 1–5), Grok 3 averages 4.25, ranking 15th of 52 tested models. Its top-performing categories are strategic analysis, faithfulness, long context, structured output, multilingual, persona consistency, and agentic planning — all scoring 5/5. These scores place it tied for first with multiple other top models on each dimension (e.g., tied for 1st with 14 others on agentic planning, tied with 32 others on faithfulness, tied with 36 others on long context). On tool calling and classification it scores 4/5, in the upper half of tested models. Its two relative weaknesses: constrained rewriting (3/5, rank 31 of 53) and creative problem solving (3/5, rank 30 of 54), both in the lower half of the distribution. Safety calibration is 2/5 (rank 12 of 55, shared by 20 models) — above the median for this notably weak benchmark category across the field. The model has a 131K context window, which supports the long-context score. No external benchmark data (SWE-bench Verified, MATH Level 5, AIME 2025) is available for Grok 3 in our dataset.

Pricing

Grok 3 costs $3.00 per million input tokens and $15.00 per million output tokens. For reference, a 1,000-word response is roughly 750 output tokens — meaning 1,000 such responses costs about $11.25 in output alone. At moderate API volume (10M output tokens/month), you’re spending $150. At 100M output tokens/month, $1,500. That places Grok 3 at the same output-cost tier as Claude Sonnet 4.6 ($15 output) and GPT-5.4 ($15 output), both of which score higher on average (4.67 and 4.58 respectively in our testing). Within xAI’s lineup, Grok 4 is the same price but scores lower (4.08 avg). If cost efficiency matters, Grok 4.20 delivers 4.33 avg at $6/M output — a meaningful step down in price for a modest score reduction. Grok 4.1 Fast cuts further to $0.50/M output at 4.25 avg, matching Grok 3’s benchmark score at a fraction of the price.

xai

Grok 3

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window131K

modelpicker.net

Real-World Costs

iChat response$0.0081
iBlog post$0.032
iDocument batch$0.810
iPipeline run$8.10

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="x-ai/grok-3",
    messages=[
        {"role": "user", "content": "Hello, Grok 3!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Grok 3 is a strong choice for developers building agentic pipelines, structured-output extraction systems, or multilingual applications where reliability and faithfulness to source material are critical. Its 5/5 scores on agentic planning, faithfulness, structured output, and multilingual make it a credible option for production workflows in these areas. However, at $15/M output, the value case is harder to make: Claude Sonnet 4.6 and GPT-5.2 score higher (4.67 avg) at the same price, and within xAI’s own lineup, Grok 4.1 Fast matches Grok 3’s benchmark average (4.25) at $0.50/M output — 30x cheaper. For users specifically committed to the xAI ecosystem or the Grok API, Grok 3 is the better performer than Grok 4 at this price tier. For creative tasks requiring flexible writing or constrained rewriting, look elsewhere — scores of 3/5 in those categories mean there are cheaper and better options.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions