Claude Haiku 4.5 vs Devstral Medium for Structured Output

Tie — Claude Haiku 4.5 and Devstral Medium both score 4/5 on Structured Output in our testing and share rank 26 of 52. Neither model outscored the other on the structured_output test itself. Choose between them based on tradeoffs: Claude Haiku 4.5 provides stronger supporting capabilities (tool_calling 5 vs 3, long_context 5 vs 4, faithfulness 5 vs 4) and multimodal input, while Devstral Medium is materially cheaper (output cost $2/mtok vs Haiku $5/mtok) with the same structured_output score.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

Structured Output demands precise JSON schema compliance and strict format adherence (our structured_output benchmark definition). In our testing both models achieved a 4/5 on that task, so the primary structured-output capability is equivalent. The differentiating capabilities for reliably producing and validating structured output are: tool calling and function-argument accuracy (helps assemble arguments and invoke validators), long-context handling (keeps large schemas and examples in context), faithfulness (avoids hallucinated fields), modality support (image→text when extracting structured data from images), and cost/throughput for production use. In our scores Claude Haiku 4.5 outperforms Devstral Medium on tool_calling (5 vs 3), long_context (5 vs 4), and faithfulness (5 vs 4), which explain why Haiku is likely to be more robust on complex, stateful, or multimodal schema tasks. Devstral Medium matches Haiku on the structured_output metric itself (4/5) and on classification, but offers lower per-token cost (input 0.4 vs 1 and output 2 vs 5 per mTok) and a large 131,072 token window, which is sufficient for many schema tasks.

Practical Examples

  1. Complex nested JSON with validator calls: Claude Haiku 4.5 is preferable because tool_calling=5 and long_context=5 help it select functions and keep large schema examples in context. Expect fewer manual fixes when orchestrating validation steps. 2) High-volume templated JSON generation (API responses, logs): Devstral Medium is attractive because it ties Haiku on structured_output (4/5) but has lower output cost ($2/mtok vs $5/mtok), yielding ~2.5x cheaper outputs per token in bulk. 3) Image-to-structured-data extraction (receipts, forms): Claude Haiku 4.5 supports text+image→text modality, making it a better fit for multimodal extraction where schema fidelity matters. 4) Small schemas and single-shot responses: Devstral Medium (structured_output 4/5, long_context 4) is a cost-efficient choice when tool calling or multimodal input is not required. All examples reflect our internal scores (structured_output 4/5 each; supporting scores cited above).

Bottom Line

For Structured Output, choose Claude Haiku 4.5 if you need stronger tool-calling, larger context capacity, higher faithfulness, or image→text extraction and can accept higher cost. Choose Devstral Medium if you need the same structured_output quality at significantly lower token cost and your workflows don’t require advanced tool orchestration or multimodal input.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions