Claude Haiku 4.5 vs Devstral Small 1.1 for Chatbots

Winner: Claude Haiku 4.5. In our Chatbots testing Claude Haiku 4.5 scores 4.00 vs Devstral Small 1.1's 2.67 on the 1–5 Chatbots metric (a 1.33-point gap). Haiku leads on persona_consistency (5 vs 2), multilingual (5 vs 4), long_context (5 vs 4), tool_calling (5 vs 4) and faithfulness (5 vs 4). Safety calibration is tied at 2. For consistent, persona-driven conversational agents and long-context dialogues choose Claude Haiku 4.5; Devstral Small 1.1 is far cheaper (output cost per mtok 0.3 vs 5.0 — ~16.67x) and may suit high-volume, simple bots where persona fidelity is not required.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

Chatbots demand consistent persona maintenance, safe refusal behavior, robust multilingual output, and long-context memory; tool selection and structured outputs matter when integrating actions or APIs. No external benchmark is provided for this task, so our winner call is based on our internal Chatbots score and component tests. On our 12-test suite components relevant to Chatbots: persona_consistency (Claude Haiku 4.5 = 5; Devstral Small 1.1 = 2), multilingual (5 vs 4), long_context (5 vs 4), safety_calibration (2 vs 2), tool_calling (5 vs 4), and structured_output (4 vs 4). These component differences explain the taskScore gap (4.00 vs 2.67) and the task ranks (Claude Haiku 4.5: rank 11 of 52; Devstral Small 1.1: rank 48 of 52). In short: persona consistency and long-context handling are the primary drivers of Chatbots performance in our testing; Claude Haiku 4.5 dominates those axes.

Practical Examples

Where Claude Haiku 4.5 shines (based on scores):

  • Enterprise support agent that preserves persona across long threads: persona_consistency 5 + long_context 5 + tool_calling 5 enable accurate, multi-step conversations and safe tool use.
  • Multilingual customer-facing virtual assistants: multilingual 5 ensures higher parity across languages.
  • Character-driven chat experiences or role-based assistants where faithfulness and persona resistance to prompt injection matter: persona_consistency 5, faithfulness 5. Where Devstral Small 1.1 is appropriate (based on scores and cost):
  • High-volume FAQ or classification bots where low latency and cost are priorities: classification 4 and structured_output 4 let you produce reliable routing/JSON outputs at output cost 0.3 per mtok.
  • Simple multilingual FAQs at lower cost: multilingual 4 is adequate for many non-expert languages, but persona and planning are weaker (persona_consistency 2, agentic_planning 2), so avoid tasks requiring sustained role or complex goal decomposition. Concrete numeric grounding: Claude Haiku 4.5 leads on persona (5 vs 2) and long_context (5 vs 4) driving its 1.33-point advantage on our Chatbots scale; Devstral Small 1.1’s economic advantage is ~16.67x lower output cost per mtok (0.3 vs 5.0).

Bottom Line

For Chatbots, choose Claude Haiku 4.5 if you need consistent persona, long-context dialogue, multilingual parity, and reliable tool calling (scores: persona_consistency 5, long_context 5, tool_calling 5; taskScore 4.00). Choose Devstral Small 1.1 if your priority is very low per-message cost and simple classification/structured-output workflows where persona fidelity and complex planning are not required (scores: classification 4, structured_output 4; taskScore 2.67; output cost per mtok 0.3 vs 5.0).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions