xai
Grok 4.1 Fast
Grok 4.1 Fast is xai's high-performance agentic model positioned for real-world production use cases including customer support and deep research workflows. With a 2 million token context window and multimodal input support (text, image, and file), it operates at a tier above Grok 3 Mini while remaining far cheaper than flagship models. At $0.20/MTok input and $0.50/MTok output, it competes directly with DeepSeek V3.2 ($0.38/MTok output, avg score 4.25) at the budget end of the high-capability tier. Reasoning can be enabled or disabled, and raw thinking traces are accessible via the include_reasoning parameter, giving developers control over latency versus transparency tradeoffs. It ranks 15th out of 52 models overall in our benchmark suite.
Performance
In our 12-test benchmark suite, Grok 4.1 Fast ranks 15th out of 52 models with an average score of 4.25. Top strengths: strategic analysis (5/5), structured output (5/5, tied for 1st with 24 other models out of 54), faithfulness (5/5, tied for 1st with 32 others out of 55), multilingual (5/5), long context (5/5, tied for 1st with 36 others out of 55), and persona consistency (5/5, tied for 1st with 36 others out of 53). This sweep of top scores across diverse task types is notable at this price point. Tool calling (4/5) and agentic planning (4/5) both score above median. The clear weakness is safety calibration at 1/5 (rank 32 of 55), the lowest safety score in the benchmark — meaning this model is more permissive than virtually every other tested model. Constrained rewriting (4/5, rank 6 of 53) and classification (4/5, tied for 1st with 29 others) round out a strong profile.
Pricing
Grok 4.1 Fast costs $0.20 per million input tokens and $0.50 per million output tokens. At 10 million output tokens per month, output cost is $5. At 100 million output tokens monthly, output cost is $50. The input rate of $0.20/MTok is particularly competitive for context-heavy workloads like RAG pipelines or long document processing — the 2M token context window enables very large batch inputs without chunking overhead. Compared to bracket peers: DeepSeek V3.2 is slightly cheaper at $0.38/MTok output with the same average score of 4.25, while Grok 3 Mini is the same $0.50/MTok output with lower benchmark coverage. Models with comparable average scores like GPT-5.4 Nano ($1.25/MTok output) and Mistral Medium 3.1 ($2/MTok output) cost 2.5x to 4x more per output token.
xai
Grok 4.1 Fast
Benchmark Scores
External Benchmarks
Pricing
Input
$0.200/MTok
Output
$0.500/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="x-ai/grok-4.1-fast",
messages=[
{"role": "user", "content": "Hello, Grok 4.1 Fast!"}
],
)
print(response.choices[0].message.content)Recommendation
Grok 4.1 Fast is an excellent choice for developers building document analysis pipelines, multilingual applications, and structured data extraction workflows. Its 5/5 faithfulness score makes it reliable for RAG, while its 5/5 structured output and 4/5 tool calling scores support function-calling pipelines. The 2M context window is a genuine differentiator for applications processing very long documents or conversation histories. Developers building customer support bots with persona requirements will benefit from the 5/5 persona consistency score. However, the 1/5 safety calibration score is a significant concern for consumer-facing or regulated applications — this model is the most permissive in our test set and should not be deployed without external safety layers in contexts where inappropriate content generation would be a risk. For cost-equivalent alternatives with better safety profiles, consider DeepSeek V3.2 at $0.38/MTok output (same avg score, data on safety calibration may differ).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.