Claude Haiku 4.5 vs Devstral Medium for Chatbots

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 4.00 on the Chatbots task vs Devstral Medium's 2.67, a clear 1.33-point lead. Haiku 4.5 outperforms on persona_consistency (5 vs 3), multilingual (5 vs 4), safety_calibration (2 vs 1) and long-context and tool-calling capabilities — all critical for conversational agents. Devstral Medium is cheaper (input/output costs 0.4/2 vs Haiku's 1/5 per mTok) and matches structured_output and classification, but its lower persona consistency, safety, and long-context scores make it a weaker choice for production chatbots in our benchmarks.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

What Chatbots demand: consistent persona, safe refusals and calibrated permissions, reliable multilingual output, and the ability to maintain long conversation context. Our Chatbots task uses three tests (persona_consistency, safety_calibration, multilingual) and the task composite is the primary measure for this verdict. In our testing Claude Haiku 4.5 posts scores of persona_consistency 5, safety_calibration 2, multilingual 5, producing the 4.00 task score and rank 11/52. Devstral Medium scores persona_consistency 3, safety_calibration 1, multilingual 4, producing the 2.67 task score and rank 48/52. Supporting proxies also matter: Haiku's long_context (5 vs 4) and tool_calling (5 vs 3) advantages reduce context loss and enable better function selection in multi-step dialogs. Where available, modality and context-window matter: Haiku supports text+image->text and a 200,000 token context window vs Devstral Medium's text->text and 131,072 tokens, which benefits complex, multimodal conversational flows. Cost and latency trade-offs are secondary but important: Haiku's output cost is higher (5 vs 2 per mTok).

Practical Examples

Where Claude Haiku 4.5 shines (grounded in scores):

  • Brand voice / persona bots: persona_consistency 5 vs 3 — fewer injection-driven persona drift, better at maintaining character across long conversations. Ideal for customer support with a strict brand tone.
  • Multilingual support: multilingual 5 vs 4 — better parity across languages for global chat deployments.
  • Long-session, multimodal assistants: long_context 5 and modality text+image->text — handles 30K+ token histories and image-based follow-ups more reliably.
  • Safer escalation: safety_calibration 2 vs 1 — more likely to refuse harmful prompts appropriately (still modest, but better than Devstral in our tests).

Where Devstral Medium is useful (grounded in scores and costs):

  • Cost-sensitive routing / classification: classification 4 (tie) and structured_output 4 (tie) at lower input/output costs (0.4/2 vs 1/5) — good for high-volume, narrow-scope bots that primarily classify and emit structured payloads.
  • Simple multilingual support: multilingual 4 — acceptable for non-critical multi-language flows where strict persona or safety is less important.
  • Lightweight assistants and experimental UIs: lower per-token cost reduces deployment expense for prototypes or internal tooling where persona fidelity is not critical.

Bottom Line

For Chatbots, choose Claude Haiku 4.5 if you need strong persona consistency, robust multilingual output, long-context conversations, or multimodal (image+text) capabilities and are willing to pay higher per-token output costs. Choose Devstral Medium if you prioritize lower per-token cost and need solid structured output and classification for high-volume, simpler conversational routing where strict persona fidelity and advanced safety calibration are not required.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions