Is Devstral Small 1.1 better than GPT-4.1 Nano?

In our testing GPT-4.1 Nano wins more benchmarks (5 wins) vs Devstral Small 1.1 (1 win). Devstral wins classification (score 4 vs 3) and is cheaper; GPT-4.1 Nano wins structured output (5 vs 4), faithfulness (5 vs 4), constrained rewriting (4 vs 3), persona consistency (4 vs 2), and agentic planning (4 vs 2).

Which model is cheaper to run?

Devstral Small 1.1 is cheaper on output tokens: $0.30 per 1k output vs GPT-4.1 Nano's $0.40 per 1k output (both charge $0.10 per 1k input). For a 1M in + 1M out token month, Devstral ≈ $400 vs GPT-4.1 Nano ≈ $500.

Which is better for strict schema/JSON outputs?

GPT-4.1 Nano: score 5 vs Devstral's 4 on structured output and GPT-4.1 Nano is tied for 1st on that test in our rankings. That makes GPT-4.1 Nano a safer pick where exact JSON/schema compliance matters.

Which is better at avoiding hallucinations (faithfulness)?

GPT-4.1 Nano scored 5 for faithfulness vs Devstral Small 1.1's 4 in our testing; GPT-4.1 Nano is tied for 1st on faithfulness in the rankings, indicating stronger adherence to source material in our benchmarks.

Devstral Small 1.1 vs GPT-4.1 Nano

Q: Do either models have external math benchmarks?

GPT-4.1 Nano includes external results: MATH Level 5 = 70% and AIME 2025 = 28.9% (Epoch AI). Devstral Small 1.1 has no external math scores included in the payload.

On the majority of our benchmarks GPT-4.1 Nano is the better pick: it wins 5 tests to Devstral Small 1.1's 1 and leads on structured output and faithfulness. Devstral Small 1.1 is the lower-cost choice and wins classification, so pick it when token cost and routing accuracy matter.

mistral

Devstral Small 1.1

Overall

3.08/5Usable

Benchmark Scores

Faithfulness

4/5

Long Context

4/5

Multilingual

4/5

Tool Calling

4/5

Classification

4/5

Agentic Planning

2/5

Structured Output

4/5

Safety Calibration

2/5

Strategic Analysis

2/5

Persona Consistency

2/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-4.1 Nano

Overall

3.58/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

4/5

Multilingual

4/5

Tool Calling

4/5

Classification

3/5

Agentic Planning

4/5

Structured Output

5/5

Safety Calibration

2/5

Strategic Analysis

2/5

Persona Consistency

4/5

Constrained Rewriting

4/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

70.0%

AIME 2025

28.9%

Pricing

Input

$0.100/MTok

Output

$0.400/MTok

Context Window1048K

modelpicker.net

Benchmark Analysis

Summary of head-to-head results in our testing: Devstral Small 1.1 wins classification (score 4 vs GPT-4.1 Nano's 3) and in our rankings is tied for 1st in classification (tied with 29 others). GPT-4.1 Nano wins five benchmarks: structured output (5 vs 4), constrained rewriting (4 vs 3), faithfulness (5 vs 4), persona consistency (4 vs 2), and agentic planning (4 vs 2). GPT-4.1 Nano ranks particularly well on structured output (tied for 1st with 24 others) and faithfulness (tied for 1st with 32 others), which means it reliably follows JSON/schema constraints and sticks closer to source material — important for APIs that require exact schema outputs and low hallucination. Ties across both models include tool calling (4/4), long context (4/4), safety calibration (2/2), multilingual (4/4), strategic analysis (2/2), and creative problem solving (2/2) — in practice these ties mean similar behavior on function-selection, retrieval at 30k+ tokens, basic refusal calibration, and non-obvious idea generation. Rankings context: Devstral's persona consistency score (2) places it low (rank 51 of 53), while GPT-4.1 Nano's persona consistency (4) sits mid-pack (rank 38 of 53). For math/external benchmarks, GPT-4.1 Nano reports MATH Level 5 = 70% and AIME 2025 = 28.9% (Epoch AI); Devstral Small 1.1 has no external math scores in the payload. Overall, GPT-4.1 Nano's wins are concentrated on format fidelity, faithfulness, persona, and multi-step planning — attributes that matter for production pipelines that enforce strict output formats and minimize hallucinations. Devstral's standout is classification accuracy and lower cost.

BenchmarkDevstral Small 1.1GPT-4.1 Nano

Faithfulness4/55/5

Long Context4/54/5

Multilingual4/54/5

Tool Calling4/54/5

Classification4/53/5

Agentic Planning2/54/5

Structured Output4/55/5

Safety Calibration2/52/5

Strategic Analysis2/52/5

Persona Consistency2/54/5

Constrained Rewriting3/54/5

Creative Problem Solving2/52/5

Summary1 wins5 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 charges $0.10 per 1k input and $0.30 per 1k output; GPT-4.1 Nano charges $0.10 per 1k input and $0.40 per 1k output. For a concrete usage scenario of 1M input tokens + 1M output tokens/month (1M = 1,000 × 1k): Devstral = (1,000×$0.10)+(1,000×$0.30) = $100 + $300 = $400/month. GPT-4.1 Nano = $100 + $400 = $500/month. At 10M in+10M out: Devstral ≈ $4,000 vs GPT-4.1 Nano ≈ $5,000. At 100M in+100M out: Devstral ≈ $40,000 vs GPT-4.1 Nano ≈ $50,000. The output-cost gap ($0.10/1k tokens) scales linearly and matters most for high-throughput apps (chat logs, long generation pipelines, or large batch inference). For low-volume or feature-driven use (needing structured outputs, stronger faithfulness or multimodal inputs), the higher GPT-4.1 Nano cost can be justified; for cost-sensitive routing or classification tasks, Devstral's savings add up quickly.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-4.1 Nano

iChat response<$0.001<$0.001

iBlog post<$0.001<$0.001

iDocument batch$0.017$0.022

iPipeline run$0.170$0.220

Bottom Line

Choose Devstral Small 1.1 if: you need the lowest per-token cost and strong classification/routing (Devstral scores 4 vs GPT-4.1 Nano's 3 in classification and is tied for 1st in our ranking), or you run very high-volume workloads where the $0.10/1k output savings materially reduce monthly bills. Choose GPT-4.1 Nano if: you require strict schema/JSON compliance, higher faithfulness (5 vs 4), better persona consistency, constrained-rewriting, or agentic planning (GPT-4.1 Nano wins these tests in our suite), or you need multimodal inputs (GPT-4.1 Nano supports text+image+file->text per the payload).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.