Codestral 2508 vs Grok 4

On the most common use case—general reasoning, classification and multilingual workloads—Grok 4 is the winner (it wins 7 of 12 benchmarks in our tests). Codestral 2508 wins where latency, structured outputs and tool-focused coding workflows matter (structured_output, tool_calling, agentic_planning), and it costs far less: $0.30/$0.90 per mTok input/output vs Grok's $3/$15.

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

Benchmark Analysis

Our 12-test comparison (scores 1–5) shows Grok 4 winning the majority: strategic_analysis 5 vs 2 (Grok ranks tied for 1st on strategic_analysis), constrained_rewriting 4 vs 3 (Grok rank 6 of 53), creative_problem_solving 3 vs 2 (Grok rank 30 of 54), classification 4 vs 3 (Grok tied for 1st on classification), safety_calibration 2 vs 1 (Grok rank 12 of 55), persona_consistency 5 vs 3 (Grok tied for 1st), and multilingual 5 vs 4 (Grok tied for 1st). Codestral 2508 wins structured_output 5 vs 4 (Codestral tied for 1st with 24 others, meaning strong JSON/schema compliance), tool_calling 5 vs 4 (Codestral tied for 1st, so better function selection/argument accuracy in our tests), and agentic_planning 4 vs 3 (Codestral rank 16 of 54, useful for goal decomposition and recovery). Faithfulness and long_context are ties at 5/5 for both models (both tied for 1st on long_context and faithfulness). What this means for real tasks: choose Grok 4 when you need top-tier strategic reasoning, robust classification/routing, better safety calibration and multilingual parity; choose Codestral 2508 for schema-constrained outputs, reliable tool-calling sequences (code generation + function calls), and lower per-token cost. The ranks (e.g., Codestral tied for 1st in tool_calling; Grok tied for 1st in strategic_analysis and classification) help translate score differences into expected production behavior rather than marginal aesthetic differences.

BenchmarkCodestral 2508Grok 4
Faithfulness5/55/5
Long Context5/55/5
Multilingual4/55/5
Tool Calling5/54/5
Classification3/54/5
Agentic Planning4/53/5
Structured Output5/54/5
Safety Calibration1/52/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary3 wins7 wins

Pricing Analysis

Codestral 2508 input cost $0.30 and output $0.90 per mTok (combined $1.20/mTok). Grok 4 input $3 and output $15 per mTok (combined $18.00/mTok). At 1M tokens/month (1,000 mTok) that is $1,200 for Codestral vs $18,000 for Grok — a $16,800 difference. At 10M tokens (10,000 mTok) it's $12,000 vs $180,000. At 100M tokens it's $120,000 vs $1,800,000. Teams with sustained high-volume usage (platforms, indexers, large-scale agents) should care deeply about this gap; the cost delta can reshape ROI and product pricing. Buy Grok 4 when its benchmark advantages (strategic analysis, classification, multilingual, persona consistency) justify the >10x per-mTok price; choose Codestral 2508 when budget and high-frequency coding/tool flows dominate.

Real-World Cost Comparison

TaskCodestral 2508Grok 4
iChat response<$0.001$0.0081
iBlog post$0.0020$0.032
iDocument batch$0.051$0.810
iPipeline run$0.510$8.10

Bottom Line

Choose Codestral 2508 if you run high-frequency, latency-sensitive coding workflows, need authoritative structured outputs or heavy tool calling, and want much lower cost ($1.20 total/mTok). Choose Grok 4 if your priority is strategic analysis, classification, multilingual quality, persona consistency, or slightly better safety calibration and you can absorb the higher price ($18.00 total/mTok). If budget is the limiting factor at scale, pick Codestral 2508; if output quality on those seven winning benchmarks materially improves product metrics for you, pick Grok 4.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions