mistral

Mistral Small 4

Mistral Small 4 (mistralai/mistral-small-2603) is the next major release in Mistral's Small family, described as unifying capabilities from several flagship Mistral models. It targets developers and product teams who need strong schema compliance, consistent persona behavior, and multilingual quality in a compact, high-context system (262,144 token window, text+image→text). In our testing it sits among high-capability small-family models and competes with bracket peers such as Claude Sonnet 4.6 and GPT-5.4, offering a different tradeoff: very strong structured and multilingual outputs at a substantially lower listed output cost ($0.60 per mTok) than some high-end peers.

Performance

In our 12-test suite breakdown, Mistral Small 4 shows clear strengths and some weaknesses: 1) Structured output — scores 5/5 and is tied for 1st (tied with 24 other models out of 54 tested), indicating excellent JSON/schema compliance. 2) Multilingual — scores 5/5 and is tied for 1st (tied with 34 other models out of 55), making it a top choice for non-English parity. 3) Persona consistency — scores 5/5 and is tied for 1st (tied with 36 other models out of 53), useful for character-driven or agentic interfaces. Additional solid areas: creative problem solving 4/5 (rank 9 of 54) and tool calling 4/5 (rank 18 of 54). Notable weaknesses: classification 2/5 (rank 51 of 53 — near the bottom), and safety calibration 2/5 (score 2/5; rank listed as 12 of 55 with many models sharing that score) — both indicate the model underperforms on categorical routing and has lower safety-calibration scores in our tests. Long_context is 4/5 (rank 38 of 55), so while the context window is huge (262,144 tokens), its retrieval/accuracy at extreme lengths is solid but not uniquely top-ranked. Overall, Mistral Small 4's overallRanking in our dataset is 35 of 52.

Pricing

Mistral Small 4 is priced at $0.15 per mTok for input and $0.60 per mTok for output (values from the model entry). What that means in practice (examples expressed in mTok units as listed):

  • Small prompt + short reply (5 mTok input, 10 mTok output): input $0.75 + output $6.00 = $6.75 total.
  • Medium conversation (20 mTok input, 50 mTok output): input $3.00 + output $30.00 = $33.00 total.
  • Very large generation (100 mTok input, 200 mTok output): input $15.00 + output $120.00 = $135.00 total. Compared with bracket peers in the payload, Small 4’s listed output cost ($0.60) is far lower than high-priced peers like Claude Opus 4.6 ($25) or Claude Sonnet 4.6 ($15) but higher than many lower-cost models (examples: Gemma 4 31B at $0.38 output). Use these sample math lines to estimate monthly spend from your own mTok volumes.

mistral

Mistral Small 4

Overall
3.83/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window262K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0013
iDocument batch$0.033
iPipeline run$0.330

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/mistral-small-2603",
    messages=[
        {"role": "user", "content": "Hello, Mistral Small 4!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Use Mistral Small 4 if you need: 1) Schema-first production flows — structured output 5/5 supports reliable JSON and strict formats (e.g., data extraction, API response generation). 2) Multilingual product UIs or localized content pipelines — multilingual 5/5 means parity across languages in our testing. 3) Persona-driven assistants and role-based prompts — persona consistency 5/5 helps maintain voice and resist injection. Avoid Small 4 when: 1) you rely on high-accuracy classification and routing — classification 2/5 (rank 51/53) suggests poor performance for intent classification or safety-critical routing. 2) your app requires the highest overall benchmark averages — our overall rank is 35/52; for maximum aggregate benchmark performance consider bracket peers listed in the payload (for example, Claude Sonnet 4.6 and GPT-5.2 have higher avg_score entries in the bracketPeers list). Practical use cases we recommend: generating validated JSON payloads from unstructured text, multilingual content transformation, persona-based content generation. Not recommended for high-stakes safety filtering or core classifier components without an additional classification layer.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions