Codestral 2508 vs Grok 4
On the most common use case—general reasoning, classification and multilingual workloads—Grok 4 is the winner (it wins 7 of 12 benchmarks in our tests). Codestral 2508 wins where latency, structured outputs and tool-focused coding workflows matter (structured_output, tool_calling, agentic_planning), and it costs far less: $0.30/$0.90 per mTok input/output vs Grok's $3/$15.
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
xai
Grok 4
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
Benchmark Analysis
Our 12-test comparison (scores 1–5) shows Grok 4 winning the majority: strategic_analysis 5 vs 2 (Grok ranks tied for 1st on strategic_analysis), constrained_rewriting 4 vs 3 (Grok rank 6 of 53), creative_problem_solving 3 vs 2 (Grok rank 30 of 54), classification 4 vs 3 (Grok tied for 1st on classification), safety_calibration 2 vs 1 (Grok rank 12 of 55), persona_consistency 5 vs 3 (Grok tied for 1st), and multilingual 5 vs 4 (Grok tied for 1st). Codestral 2508 wins structured_output 5 vs 4 (Codestral tied for 1st with 24 others, meaning strong JSON/schema compliance), tool_calling 5 vs 4 (Codestral tied for 1st, so better function selection/argument accuracy in our tests), and agentic_planning 4 vs 3 (Codestral rank 16 of 54, useful for goal decomposition and recovery). Faithfulness and long_context are ties at 5/5 for both models (both tied for 1st on long_context and faithfulness). What this means for real tasks: choose Grok 4 when you need top-tier strategic reasoning, robust classification/routing, better safety calibration and multilingual parity; choose Codestral 2508 for schema-constrained outputs, reliable tool-calling sequences (code generation + function calls), and lower per-token cost. The ranks (e.g., Codestral tied for 1st in tool_calling; Grok tied for 1st in strategic_analysis and classification) help translate score differences into expected production behavior rather than marginal aesthetic differences.
Pricing Analysis
Codestral 2508 input cost $0.30 and output $0.90 per mTok (combined $1.20/mTok). Grok 4 input $3 and output $15 per mTok (combined $18.00/mTok). At 1M tokens/month (1,000 mTok) that is $1,200 for Codestral vs $18,000 for Grok — a $16,800 difference. At 10M tokens (10,000 mTok) it's $12,000 vs $180,000. At 100M tokens it's $120,000 vs $1,800,000. Teams with sustained high-volume usage (platforms, indexers, large-scale agents) should care deeply about this gap; the cost delta can reshape ROI and product pricing. Buy Grok 4 when its benchmark advantages (strategic analysis, classification, multilingual, persona consistency) justify the >10x per-mTok price; choose Codestral 2508 when budget and high-frequency coding/tool flows dominate.
Real-World Cost Comparison
Bottom Line
Choose Codestral 2508 if you run high-frequency, latency-sensitive coding workflows, need authoritative structured outputs or heavy tool calling, and want much lower cost ($1.20 total/mTok). Choose Grok 4 if your priority is strategic analysis, classification, multilingual quality, persona consistency, or slightly better safety calibration and you can absorb the higher price ($18.00 total/mTok). If budget is the limiting factor at scale, pick Codestral 2508; if output quality on those seven winning benchmarks materially improves product metrics for you, pick Grok 4.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.