Devstral Small 1.1 vs GPT-4.1 Nano

On the majority of our benchmarks GPT-4.1 Nano is the better pick: it wins 5 tests to Devstral Small 1.1's 1 and leads on structured output and faithfulness. Devstral Small 1.1 is the lower-cost choice and wins classification, so pick it when token cost and routing accuracy matter.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-4.1 Nano

Overall
3.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
4/5
Constrained Rewriting
4/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
70.0%
AIME 2025
28.9%

Pricing

Input

$0.100/MTok

Output

$0.400/MTok

Context Window1048K

modelpicker.net

Benchmark Analysis

Summary of head-to-head results in our testing: Devstral Small 1.1 wins classification (score 4 vs GPT-4.1 Nano's 3) and in our rankings is tied for 1st in classification (tied with 29 others). GPT-4.1 Nano wins five benchmarks: structured output (5 vs 4), constrained rewriting (4 vs 3), faithfulness (5 vs 4), persona consistency (4 vs 2), and agentic planning (4 vs 2). GPT-4.1 Nano ranks particularly well on structured output (tied for 1st with 24 others) and faithfulness (tied for 1st with 32 others), which means it reliably follows JSON/schema constraints and sticks closer to source material — important for APIs that require exact schema outputs and low hallucination. Ties across both models include tool calling (4/4), long context (4/4), safety calibration (2/2), multilingual (4/4), strategic analysis (2/2), and creative problem solving (2/2) — in practice these ties mean similar behavior on function-selection, retrieval at 30k+ tokens, basic refusal calibration, and non-obvious idea generation. Rankings context: Devstral's persona consistency score (2) places it low (rank 51 of 53), while GPT-4.1 Nano's persona consistency (4) sits mid-pack (rank 38 of 53). For math/external benchmarks, GPT-4.1 Nano reports MATH Level 5 = 70% and AIME 2025 = 28.9% (Epoch AI); Devstral Small 1.1 has no external math scores in the payload. Overall, GPT-4.1 Nano's wins are concentrated on format fidelity, faithfulness, persona, and multi-step planning — attributes that matter for production pipelines that enforce strict output formats and minimize hallucinations. Devstral's standout is classification accuracy and lower cost.

BenchmarkDevstral Small 1.1GPT-4.1 Nano
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/52/5
Persona Consistency2/54/5
Constrained Rewriting3/54/5
Creative Problem Solving2/52/5
Summary1 wins5 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 charges $0.10 per 1k input and $0.30 per 1k output; GPT-4.1 Nano charges $0.10 per 1k input and $0.40 per 1k output. For a concrete usage scenario of 1M input tokens + 1M output tokens/month (1M = 1,000 × 1k): Devstral = (1,000×$0.10)+(1,000×$0.30) = $100 + $300 = $400/month. GPT-4.1 Nano = $100 + $400 = $500/month. At 10M in+10M out: Devstral ≈ $4,000 vs GPT-4.1 Nano ≈ $5,000. At 100M in+100M out: Devstral ≈ $40,000 vs GPT-4.1 Nano ≈ $50,000. The output-cost gap ($0.10/1k tokens) scales linearly and matters most for high-throughput apps (chat logs, long generation pipelines, or large batch inference). For low-volume or feature-driven use (needing structured outputs, stronger faithfulness or multimodal inputs), the higher GPT-4.1 Nano cost can be justified; for cost-sensitive routing or classification tasks, Devstral's savings add up quickly.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-4.1 Nano
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.022
iPipeline run$0.170$0.220

Bottom Line

Choose Devstral Small 1.1 if: you need the lowest per-token cost and strong classification/routing (Devstral scores 4 vs GPT-4.1 Nano's 3 in classification and is tied for 1st in our ranking), or you run very high-volume workloads where the $0.10/1k output savings materially reduce monthly bills. Choose GPT-4.1 Nano if: you require strict schema/JSON compliance, higher faithfulness (5 vs 4), better persona consistency, constrained-rewriting, or agentic planning (GPT-4.1 Nano wins these tests in our suite), or you need multimodal inputs (GPT-4.1 Nano supports text+image+file->text per the payload).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions