Claude Haiku 4.5 vs Devstral Medium for Translation

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 5 on the Translation task versus Devstral Medium’s 4 (task uses multilingual and faithfulness tests). Haiku 4.5 outperforms on multilingual (5 vs 4) and faithfulness (5 vs 4), and adds a much larger context window (200,000 vs 131,072 tokens) and image->text modality, which matter for long documents and image-driven localization. Devstral Medium is cheaper (output cost 2 vs 5 per mTok) and still competent, but Haiku 4.5 is the definitive choice for higher-quality, high-context translation in our benchmarks.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

What Translation demands: accurate cross-language equivalence, preservation of meaning (faithfulness), natural target-language phrasing, and consistent handling of long documents and localization constraints. Our Translation task is evaluated using two tests: multilingual and faithfulness. In our testing Claude Haiku 4.5 scores 5 on both implicating top-tier multilingual quality and source fidelity; Devstral Medium scores 4 on both. Supporting internal signals: Haiku’s long_context 5 and persona_consistency 5 indicate stronger performance on long documents and consistent localized voice; tool_calling 5 suggests easier integration with localization pipelines and CAT tools. Devstral Medium’s strengths are cost-efficiency and solid structured_output (4), but it ranks substantially lower on taskRank (Haiku rank 1 of 52; Devstral rank 40 of 52).

Practical Examples

Where Claude Haiku 4.5 shines (based on score differences):

  • Enterprise localization of long product manuals: Haiku’s long_context=5 and faithfulness=5 reduce dropped references and preserve technical accuracy across chapters.
  • Image-rich localization (UI screenshots, menus): Haiku’s modality is text+image->text (present in the payload), so it can accept images for translation workflows.
  • High-stakes marketing or legal translation: higher multilingual=5 and persona_consistency=5 help maintain tone and contractual precision. Where Devstral Medium is appropriate (based on costs and scores):
  • Bulk, lower-cost translation of short-form content: output_cost_per_mtok 2 vs Haiku 5 lowers spend while delivering acceptable quality (task score 4).
  • Rapid iteration for internal content or classification-led routing: Devstral’s structured_output=4 and classification=4 make it effective for formatted translation pipelines where absolute top-tier fidelity is not required. Quantified comparison: multilingual 5 vs 4 and faithfulness 5 vs 4 in our tests; context window 200,000 vs 131,072; output cost per mTok 5 vs 2 (Haiku vs Devstral).

Bottom Line

For Translation, choose Claude Haiku 4.5 if you need top-tier multilingual accuracy and faithfulness, long-document or image-based localization, or tight persona consistency. Choose Devstral Medium if you prioritize lower per-token cost and good-enough translations for short-form or bulk internal content.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

For translation tasks, we supplement our benchmark suite with WMT/FLORES scores from Epoch AI, an independent research organization.

Frequently Asked Questions