xai
Grok 3
Grok 3 is xAI’s general-purpose flagship, positioned for enterprise workloads: data extraction, long-document analysis, structured pipelines, and multilingual applications. At $3/$15 per million tokens (input/output), it sits at the premium end of the market alongside Claude Sonnet 4.6, GPT-5.2, and GPT-5.4. Within xAI’s own lineup, Grok 3 is the mid-tier offering — above the $0.50 output Grok 3 Mini and Grok 4.1 Fast, and matching Grok 4 on price. Notably, in our benchmarks Grok 3 outscores its own newer sibling Grok 4 (4.25 average vs 4.08), making it the stronger choice among xAI models at this price point. Context window is 131,072 tokens, suitable for processing long documents or large codebases.
Performance
In our 12-test benchmark suite (scored 1–5), Grok 3 averages 4.25, ranking 15th of 52 tested models. Its top-performing categories are strategic analysis, faithfulness, long context, structured output, multilingual, persona consistency, and agentic planning — all scoring 5/5. These scores place it tied for first with multiple other top models on each dimension (e.g., tied for 1st with 14 others on agentic planning, tied with 32 others on faithfulness, tied with 36 others on long context). On tool calling and classification it scores 4/5, in the upper half of tested models. Its two relative weaknesses: constrained rewriting (3/5, rank 31 of 53) and creative problem solving (3/5, rank 30 of 54), both in the lower half of the distribution. Safety calibration is 2/5 (rank 12 of 55, shared by 20 models) — above the median for this notably weak benchmark category across the field. The model has a 131K context window, which supports the long-context score. No external benchmark data (SWE-bench Verified, MATH Level 5, AIME 2025) is available for Grok 3 in our dataset.
Pricing
Grok 3 costs $3.00 per million input tokens and $15.00 per million output tokens. For reference, a 1,000-word response is roughly 750 output tokens — meaning 1,000 such responses costs about $11.25 in output alone. At moderate API volume (10M output tokens/month), you’re spending $150. At 100M output tokens/month, $1,500. That places Grok 3 at the same output-cost tier as Claude Sonnet 4.6 ($15 output) and GPT-5.4 ($15 output), both of which score higher on average (4.67 and 4.58 respectively in our testing). Within xAI’s lineup, Grok 4 is the same price but scores lower (4.08 avg). If cost efficiency matters, Grok 4.20 delivers 4.33 avg at $6/M output — a meaningful step down in price for a modest score reduction. Grok 4.1 Fast cuts further to $0.50/M output at 4.25 avg, matching Grok 3’s benchmark score at a fraction of the price.
xai
Grok 3
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="x-ai/grok-3",
messages=[
{"role": "user", "content": "Hello, Grok 3!"}
],
)
print(response.choices[0].message.content)Recommendation
Grok 3 is a strong choice for developers building agentic pipelines, structured-output extraction systems, or multilingual applications where reliability and faithfulness to source material are critical. Its 5/5 scores on agentic planning, faithfulness, structured output, and multilingual make it a credible option for production workflows in these areas. However, at $15/M output, the value case is harder to make: Claude Sonnet 4.6 and GPT-5.2 score higher (4.67 avg) at the same price, and within xAI’s own lineup, Grok 4.1 Fast matches Grok 3’s benchmark average (4.25) at $0.50/M output — 30x cheaper. For users specifically committed to the xAI ecosystem or the Grok API, Grok 3 is the better performer than Grok 4 at this price tier. For creative tasks requiring flexible writing or constrained rewriting, look elsewhere — scores of 3/5 in those categories mean there are cheaper and better options.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.