xai
Grok 3 Mini
Grok 3 Mini is xai’s lightweight, reasoning-first text model designed for fast, logic-oriented tasks that need long-context recall. It sits below xai’s larger Grok siblings (Grok 3 and Grok 4.20) as a lower-cost, lower-capacity option that exposes raw thinking traces. Versus bracket peers, it trades away top-tier average benchmark scores for a much lower output cost ($0.50 per mTok) while excelling at long-context retrieval, tool-calling, and faithfulness.
Performance
Summary of our benchmark scores (payload): long context 5, tool calling 5, faithfulness 5, persona consistency 5, structured output 4, classification 4, constrained rewriting 4, multilingual 4, creative problem solving 3, agentic planning 3, strategic analysis 3, safety calibration 2. Top strengths:
- Long-context retrieval — score 5 and tied for 1st ("tied for 1st with 36 other models out of 55 tested"): excellent for tasks requiring accurate retrieval at 30K+ tokens.
- Tool-calling — score 5 and tied for 1st ("tied for 1st with 16 other models out of 54 tested"): reliable function selection, argument formation, and sequencing in our tests.
- Faithfulness (score 5, tied for 1st) — resists hallucination and sticks to source material in our benchmarks. Where it ranks overall: Grok 3 Mini’s overall placement is 31 of 52 tested models. It shares top-tier scores on several dimensions (persona consistency, faithfulness, tool calling, long context, classification) but lacks a reported average bench grade in the payload. Notable weaknesses: safety calibration is low (score 2; rank display: "rank 12 of 55 (20 models share this score)"), and agentic planning and strategic analysis are middling (score 3). In practice that means Grok 3 Mini is strong at faithful, long-context, tool-backed workloads but less reliable at nuanced refusal behavior and advanced multi-step planning.
Pricing
Costs are listed per mTok in the payload: input $0.30 / output $0.50 per mTok. Practical examples (combined input+output cost assuming equal input and output volumes):
- 1 mTok in + 1 mTok out = $0.80 total
- 10 mTok in + 10 mTok out = $8.00 total
- 100 mTok in + 100 mTok out = $80.00 total If your workload is output-heavy (e.g., 1 mTok input + 5 mTok output), expect $0.30 + $2.50 = $2.80 for that set. Compared with bracket peers in the payload, Grok 3 Mini’s $0.50 output cost is far below high-cost peers like Claude Opus 4.6 ($25 output) or GPT-5.4 ($15), and slightly above the cheapest peers (Gemma 4 31B at $0.38 or DeepSeek V3.2 at $0.38). Use Grok 3 Mini when long-context and tool workflows dominate and you need predictable low pricing per mTok.
xai
Grok 3 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.500/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="x-ai/grok-3-mini",
messages=[
{"role": "user", "content": "Hello, Grok 3 Mini!"}
],
)
print(response.choices[0].message.content)Recommendation
Use Grok 3 Mini if you need:
- Production-grade retrieval over very long documents (long context 5, tied for 1st).
- Cost-efficient tool integration where accurate function selection matters (tool calling 5). Example: orchestration layer that chooses APIs and formats arguments.
- Workflows that require strict adherence to source material or persona (faithfulness 5; persona consistency 5). Example: automated summarization of legal excerpts where faithfulness and consistent tone matter. Avoid Grok 3 Mini if you need:
- Strong safety calibration and fine-grained refusal behavior for user-facing moderation-sensitive flows (safety calibration 2).
- Heavy agentic planning or deep strategic analysis (agentic planning 3; strategic analysis 3) — choose a higher-ranked peer for complex multi-step planning. Because Grok 3 Mini does not have an average bench grade in the payload, prefer it when its measured strengths (long context, tool-calling, faithfulness) match your task and you want lower per-mTok costs.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.