Question 1

Is R1 0528 better than Llama 4 Scout?

Accepted Answer

On our 12-test suite R1 0528 wins 9 benchmarks and ties 3; Llama 4 Scout wins 0. R1 outscored Scout on tool_calling (5 vs 4), agentic_planning (5 vs 2), persona_consistency (5 vs 3), and faithfulness (5 vs 4).

Question 2

Which model is cheaper to run?

Accepted Answer

Llama 4 Scout is far cheaper: output cost $0.30 per mTok vs R1 0528 $2.15 per mTok. That’s ~7.17× cheaper per token; output-only cost per 1M tokens is Scout $300 vs R1 $2,150.

Question 3

Which is better for tool-based agents and function calling?

Accepted Answer

R1 0528: tool_calling 5 vs 4 for Scout and R1 is tied for 1st of 54 models on tool_calling while Scout ranks 18 of 54. In our tests R1 selects functions and fills arguments more accurately.

Question 4

Does either model handle very long contexts or multimodal inputs?

Accepted Answer

Llama 4 Scout has the larger context window (327,680) and supports text+image->text. R1 has a 163,840 context window and is text->text. Both tie on long_context score (5 vs 5) and are tied for 1st in that category in our rankings.

Question 5

Are there any quirks to watch for with R1 0528?

Accepted Answer

Yes. R1’s payload notes it can return empty responses on structured_output and constrained_rewriting, uses reasoning tokens (which consume output budget on short tasks), and requires high max completion tokens (min_max_completion_tokens = 1000). Plan prompts and token limits accordingly.

Question 6

How does R1 perform on external math benchmarks?

Accepted Answer

Beyond our internal suite, R1 scores 96.6% on MATH Level 5 (Epoch AI) and 66.4% on AIME 2025 (Epoch AI) according to the payload — data points showing strong performance on high-level math tests.

R1 0528 vs Llama 4 Scout

R1 0528

Llama 4 Scout

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions