Claude Haiku 4.5 vs Devstral Small 1.1
Claude Haiku 4.5 is the practical winner for most developers and product teams, winning 8 of 12 of our benchmarks (tool calling, long-context, faithfulness, agentic planning and more). Devstral Small 1.1 doesn't beat Haiku on any benchmark in our tests but is drastically cheaper ($0.1 input / $0.3 output per M vs Haiku's $1 / $5), making it the right pick for high-volume, cost-sensitive inference.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
Overview: In our 12-test suite Haiku wins 8 tests, Devstral wins 0, and 4 are ties. Scores (Haiku vs Devstral): strategic_analysis 5 vs 2; creative_problem_solving 4 vs 2; tool_calling 5 vs 4; faithfulness 5 vs 4; long_context 5 vs 4; persona_consistency 5 vs 2; agentic_planning 5 vs 2; multilingual 5 vs 4; structured_output 4 vs 4 (tie); constrained_rewriting 3 vs 3 (tie); classification 4 vs 4 (tie); safety_calibration 2 vs 2 (tie). What this means in practice: - Strategic analysis (5 vs 2): Haiku’s top score (tied for 1st among 54) shows it handles nuanced tradeoffs and multi-step numeric reasoning far better — useful for pricing models, financial tradeoffs, or product prioritization. - Tool calling (5 vs 4): Haiku is tied for 1st (out of 54) on function selection and argument accuracy; expect fewer incorrect tool calls and better sequencing for agentic workflows. - Long context (5 vs 4): Haiku is tied for 1st on 30K+ retrieval tasks; Devstral ranks lower (38 of 55). Choose Haiku for long transcripts or documents. - Faithfulness (5 vs 4): Haiku ties for 1st (less hallucination risk in our tests); Devstral is mid-pack (rank 34/55). - Agentic planning & persona_consistency (5 vs 2 and 5 vs 2): Haiku ties for 1st on both; Devstral scores poorly on planning and persona, making Haiku much better for robust multi-step agents and consistent system personas. - Creative problem solving (4 vs 2): Haiku ranks top-decile (rank 9/54); Devstral is low (rank 47/54), so Haiku produces more feasible, non-obvious ideas. - Ties (structured_output, constrained_rewriting, classification, safety_calibration): Both models perform equivalently on schema adherence, compression within tight limits, categorization, and safety refusal behavior (score 2 on safety). Rankings provide context: e.g., classification is tied for 1st for both models, so neither has an advantage for routing or categorization tasks in our tests. In short: Haiku’s wins concentrate on complex reasoning, agentic workflows, long-context retrieval and multilingual fidelity; Devstral matches Haiku on basic structured outputs and classification but lags on higher-level reasoning and planning.
Pricing Analysis
Pricing (payload units are per million tokens): Claude Haiku 4.5 = $1 input / $5 output per M. Devstral Small 1.1 = $0.10 input / $0.30 output per M. Under a 50/50 input-output token mix per million tokens, Haiku costs $3.00/M (0.5*$1 + 0.5*$5) vs Devstral $0.20/M (0.5*$0.10 + 0.5*$0.30). At 1M tokens/month that’s $3.00 vs $0.20; at 10M it’s $30 vs $2; at 100M it’s $300 vs $20. If your workload is output-heavy (e.g., 20% input / 80% output), Haiku rises to $4.20/M while Devstral is $0.26/M. The cost gap (priceRatio ≈ 16.67) means teams doing millions of tokens/month or running large deployed agents should prioritize Devstral for cost efficiency; teams that need top-tier tool-calling, long-context and faithfulness should budget for Haiku despite the higher cost.
Real-World Cost Comparison
Bottom Line
Choose Claude Haiku 4.5 if you need: - Best-in-suite tool calling, long-context retrieval, faithfulness and agentic planning (scores 5 in our tests and top ranks), e.g., multi-step agents, long-document analysis, multilingual assistants, or apps where correctness matters more than per-token cost. Choose Devstral Small 1.1 if you need: - Extremely low per-token cost ($0.10/$0.30 per M) with acceptable structured output and classification parity, e.g., large-scale inference, low-cost customer routing, or batch classification where advanced planning and deepest-reasoning are not required. If budget and scale dominate (10M+ tokens/month), Devstral’s cost advantage becomes decisive; if accuracy, planning, and long-context fidelity matter, budget for Haiku.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.