Devstral Medium vs GPT-4.1 Mini

GPT-4.1 Mini is the stronger general-purpose choice, winning 8 of 12 benchmarks in our testing against Devstral Medium's 1 win and 3 ties. It also edges out on output cost ($1.60/M tokens vs. $2.00/M), meaning you pay less and get more across most task types. Devstral Medium's only benchmark win is classification, so it holds a narrow edge for routing and categorization workloads — everything else favors GPT-4.1 Mini.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-4.1 Mini

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
87.3%
AIME 2025
44.7%

Pricing

Input

$0.400/MTok

Output

$1.60/MTok

Context Window1048K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, GPT-4.1 Mini wins 8 benchmarks, Devstral Medium wins 1, and they tie on 3. Here's what that looks like test by test:

Where GPT-4.1 Mini leads:

  • Strategic analysis: GPT-4.1 Mini scores 4 vs. Devstral Medium's 2 (rank 27 of 54 vs. rank 44 of 54). A two-point gap here is significant — this test covers nuanced tradeoff reasoning with real numbers, the kind of work that matters for business analysis, due diligence, or decision support tools.
  • Persona consistency: 5 vs. 3 (tied for 1st among 53 models vs. rank 45 of 53). GPT-4.1 Mini maintains character and resists prompt injection at a top-tier level; Devstral Medium sits near the bottom of the field.
  • Tool calling: 4 vs. 3 (rank 18 of 54 vs. rank 47 of 54). Function selection, argument accuracy, and sequencing — critical for agentic workflows. Devstral Medium's rank 47 here is a meaningful weakness given that it's described as an agentic reasoning model.
  • Multilingual: 5 vs. 4 (tied for 1st among 55 models vs. rank 36 of 55). GPT-4.1 Mini is at the ceiling; Devstral Medium is mid-field.
  • Long context: 5 vs. 4 (tied for 1st among 55 models vs. rank 38 of 55). Paired with GPT-4.1 Mini's 1M+ context window, this makes it the clear choice for document-heavy applications.
  • Safety calibration: 2 vs. 1 (rank 12 of 55 vs. rank 32 of 55). Both models score below the field median (p50 = 2), but GPT-4.1 Mini is at least in the upper half while Devstral Medium is near the bottom.
  • Creative problem solving: 3 vs. 2 (rank 30 of 54 vs. rank 47 of 54). Non-obvious and feasible idea generation favors GPT-4.1 Mini substantially.
  • Constrained rewriting: 4 vs. 3 (rank 6 of 53 vs. rank 31 of 53). Compression within hard character limits — GPT-4.1 Mini is near the top of the field here.

Where Devstral Medium leads:

  • Classification: 4 vs. 3 (tied for 1st among 53 models vs. rank 31 of 53). This is Devstral Medium's strongest result — it ties for the top spot on accurate categorization and routing tasks, while GPT-4.1 Mini sits in the lower-middle of the field.

Where they tie:

  • Structured output (both 4, rank 26 of 54), faithfulness (both 4, rank 34 of 55), and agentic planning (both 4, rank 16 of 54) are exact ties — neither model has an edge on JSON compliance, source faithfulness, or goal decomposition.

External benchmarks (Epoch AI): GPT-4.1 Mini has third-party math scores in the payload: 87.3% on MATH Level 5 (rank 9 of 14 models with scores) and 44.7% on AIME 2025 (rank 18 of 23 models with scores). No equivalent external benchmark scores are available for Devstral Medium in our data. The MATH Level 5 score of 87.3% is above the field median of 94.15% for models that have scores on this test, placing GPT-4.1 Mini in the lower half of math-capable models tracked — solid, but not a standout strength.

BenchmarkDevstral MediumGPT-4.1 Mini
Faithfulness4/54/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/54/5
Classification4/53/5
Agentic Planning4/54/5
Structured Output4/54/5
Safety Calibration1/52/5
Strategic Analysis2/54/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary1 wins8 wins

Pricing Analysis

Both models charge identical input prices at $0.40 per million tokens, so the cost difference is entirely on the output side: GPT-4.1 Mini at $1.60/M output tokens vs. Devstral Medium at $2.00/M — a 25% premium for Devstral Medium. At 1M output tokens/month, that's $1.60 vs. $2.00 — negligible. At 10M output tokens, it's $16 vs. $20 — a $4/month gap. At 100M output tokens, GPT-4.1 Mini saves you $40/month. The cost difference only becomes meaningful at high throughput, but since GPT-4.1 Mini also wins on most benchmarks, there's no performance tradeoff to justify Devstral Medium's higher output cost for general workloads. GPT-4.1 Mini also has a dramatically larger context window — 1,047,576 tokens vs. Devstral Medium's 131,072 — which matters if your use case involves long documents or conversations, without any additional pricing complexity in the payload data.

Real-World Cost Comparison

TaskDevstral MediumGPT-4.1 Mini
iChat response$0.0011<$0.001
iBlog post$0.0042$0.0034
iDocument batch$0.108$0.088
iPipeline run$1.08$0.880

Bottom Line

Choose GPT-4.1 Mini if you need a strong general-purpose AI for strategic analysis, persona-consistent chatbots, agentic tool-calling workflows, multilingual output, or long-document processing. It wins 8 of 12 benchmarks in our testing and costs less on output tokens ($1.60/M vs. $2.00/M). Its 1M+ context window is a decisive advantage for applications that need to process large documents or maintain long conversation histories.

Choose Devstral Medium if your primary workload is classification and routing — it ties for 1st on that benchmark among 53 models in our testing, while GPT-4.1 Mini ranks 31st. If you're building a content tagger, a support ticket router, or any system where accurate categorization is the core job, Devstral Medium has a measurable edge there. Be aware, however, that its tool calling (rank 47 of 54) and persona consistency (rank 45 of 53) scores are near the bottom of the field, which limits its viability for agentic pipelines despite its positioning as an agentic model.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions