meta-llama
Llama 4 Scout
Llama 4 Scout is a multimodal (text+image) model from meta-llama, available at $0.08 per million input tokens and $0.30 per million output tokens. It sits at the budget end of the model spectrum, designed for teams that need a capable, low-cost model for structured tasks and classification workflows. In our testing, Llama 4 Scout ranked 48th overall out of 52 models — it is not a top performer across the board, but it has real strengths in long-context retrieval and classification that make it viable for specific use cases. Compared to Llama 4 Maverick (avg score 3.36, $0.60/M output), Scout is cheaper but scores lower on average. Against competitors in the ultra-low-cost tier, it competes with Ministral 3 3B 2512 ($0.10/M output, avg 3.58) and Ministral 3 8B 2512 ($0.15/M output, avg 3.67) — both of which score higher on average despite lower prices.
Performance
In our 12-benchmark testing suite, Llama 4 Scout's three strongest areas are long-context retrieval, classification, and multilingual output. On long-context, it scored 5/5, tying with 36 other models for the top score. On classification, it scored 4/5, tied for 1st with 29 other models out of 53 tested. On multilingual, it scored 4/5 (rank 36 of 55, shared with 17 others). These are Scout's best use cases. The weaknesses are significant: agentic planning scored 2/5 (rank 53 of 54 — second to last among all tested models), strategic analysis scored 2/5 (rank 44 of 54), and persona consistency scored 3/5 (rank 45 of 53). Safety calibration scored 2/5 (rank 12 of 55), which is actually above-median in that category given the field's generally low scores. Overall, the model ranks 48 of 52, reflecting a profile suited to narrow retrieval and classification tasks rather than reasoning-intensive or agentic work.
Pricing
At $0.08 per million input tokens and $0.30 per million output tokens, Llama 4 Scout is one of the most affordable multimodal models in the tested set. For context: a team running 10 million output tokens per month would spend $3.00 with Llama 4 Scout — compared to $1.50 with Ministral 3 8B 2512 or $1.00 with Ministral 3 3B 2512. At 100 million output tokens per month, Scout costs $30 versus $15 or $10 for those options. The pricing is competitive for budget-conscious use cases, but teams strictly optimizing for cost per quality point may find better value in the Ministral line, which scores higher at lower prices. Scout's multimodal capability (accepting image inputs) is a differentiator at this price point — if image understanding is required, the pricing becomes more favorable relative to text-only alternatives.
meta-llama
Llama 4 Scout
Benchmark Scores
External Benchmarks
Pricing
Input
$0.080/MTok
Output
$0.300/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="meta-llama/llama-4-scout",
messages=[
{"role": "user", "content": "Hello, Llama 4 Scout!"}
],
)
print(response.choices[0].message.content)Recommendation
Llama 4 Scout is a reasonable choice for teams with three specific needs: (1) long-document retrieval tasks at scale where cost matters and retrieval accuracy at 30K+ tokens is the primary metric — its 5/5 long-context score holds up; (2) high-volume classification and routing pipelines where 4/5 classification accuracy is sufficient and cost per call needs to be minimal; (3) multilingual processing jobs where equivalent-quality output in non-English languages matters and budget is constrained. Look elsewhere if your use case involves agentic workflows — its agentic planning score of 2/5 at rank 53 of 54 is a hard disqualifier. Similarly, avoid Scout for strategic analysis, persona-driven applications, or complex reasoning tasks. Teams that need multimodal (image input) capabilities at the lowest possible price point may find it viable, but should benchmark against Ministral 3 8B 2512 and Ministral 3 3B 2512, which score higher per dollar on text-only tasks.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.