meta-llama

Llama 4 Scout

Llama 4 Scout is a multimodal (text+image) model from meta-llama, available at $0.08 per million input tokens and $0.30 per million output tokens. It sits at the budget end of the model spectrum, designed for teams that need a capable, low-cost model for structured tasks and classification workflows. In our testing, Llama 4 Scout ranked 48th overall out of 52 models — it is not a top performer across the board, but it has real strengths in long-context retrieval and classification that make it viable for specific use cases. Compared to Llama 4 Maverick (avg score 3.36, $0.60/M output), Scout is cheaper but scores lower on average. Against competitors in the ultra-low-cost tier, it competes with Ministral 3 3B 2512 ($0.10/M output, avg 3.58) and Ministral 3 8B 2512 ($0.15/M output, avg 3.67) — both of which score higher on average despite lower prices.

Performance

In our 12-benchmark testing suite, Llama 4 Scout's three strongest areas are long-context retrieval, classification, and multilingual output. On long-context, it scored 5/5, tying with 36 other models for the top score. On classification, it scored 4/5, tied for 1st with 29 other models out of 53 tested. On multilingual, it scored 4/5 (rank 36 of 55, shared with 17 others). These are Scout's best use cases. The weaknesses are significant: agentic planning scored 2/5 (rank 53 of 54 — second to last among all tested models), strategic analysis scored 2/5 (rank 44 of 54), and persona consistency scored 3/5 (rank 45 of 53). Safety calibration scored 2/5 (rank 12 of 55), which is actually above-median in that category given the field's generally low scores. Overall, the model ranks 48 of 52, reflecting a profile suited to narrow retrieval and classification tasks rather than reasoning-intensive or agentic work.

Pricing

At $0.08 per million input tokens and $0.30 per million output tokens, Llama 4 Scout is one of the most affordable multimodal models in the tested set. For context: a team running 10 million output tokens per month would spend $3.00 with Llama 4 Scout — compared to $1.50 with Ministral 3 8B 2512 or $1.00 with Ministral 3 3B 2512. At 100 million output tokens per month, Scout costs $30 versus $15 or $10 for those options. The pricing is competitive for budget-conscious use cases, but teams strictly optimizing for cost per quality point may find better value in the Ministral line, which scores higher at lower prices. Scout's multimodal capability (accepting image inputs) is a differentiator at this price point — if image understanding is required, the pricing becomes more favorable relative to text-only alternatives.

meta-llama

Llama 4 Scout

Overall
3.33/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.080/MTok

Output

$0.300/MTok

Context Window328K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post<$0.001
iDocument batch$0.017
iPipeline run$0.166

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout",
    messages=[
        {"role": "user", "content": "Hello, Llama 4 Scout!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Llama 4 Scout is a reasonable choice for teams with three specific needs: (1) long-document retrieval tasks at scale where cost matters and retrieval accuracy at 30K+ tokens is the primary metric — its 5/5 long-context score holds up; (2) high-volume classification and routing pipelines where 4/5 classification accuracy is sufficient and cost per call needs to be minimal; (3) multilingual processing jobs where equivalent-quality output in non-English languages matters and budget is constrained. Look elsewhere if your use case involves agentic workflows — its agentic planning score of 2/5 at rank 53 of 54 is a hard disqualifier. Similarly, avoid Scout for strategic analysis, persona-driven applications, or complex reasoning tasks. Teams that need multimodal (image input) capabilities at the lowest possible price point may find it viable, but should benchmark against Ministral 3 8B 2512 and Ministral 3 3B 2512, which score higher per dollar on text-only tasks.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.