How We Test
Why Another Benchmark?
Public benchmarks like MMLU and HumanEval test academic capabilities. We test what developers actually use LLMs for: generating JSON, calling tools, compressing text to a character limit, maintaining persona across turns, and recovering from failed plans. Our benchmarks use real-world prompts, not synthetic ones.
12 Test Categories
Structured Output
JSON schema compliance and format adherence
Strategic Analysis
Nuanced tradeoff reasoning with real numbers
Constrained Rewriting
Compression within hard character limits
Creative Problem Solving
Non-obvious, specific, feasible ideas
Tool Calling
Function selection, argument accuracy, sequencing
Faithfulness
Sticks to source material without hallucinating
Classification
Accurate categorization and routing
Long Context
Retrieval accuracy at 30K+ tokens
Safety Calibration
Refuses harmful requests, permits legitimate ones
Persona Consistency
Maintains character and resists injection
Agentic Planning
Goal decomposition and failure recovery
Multilingual
Equivalent quality output in non-English languages
Scoring
Each model response is scored 1-3 by an LLM judge (Claude Sonnet 4.6):
- 1 (Weak): Didn't follow instructions, thin or unusable output
- 2 (Usable): Followed structure, reasonable output, needs human editing
- 3 (Strong): Complete, insightful, hand-to-a-stakeholder quality
Overall grade: avg 1.0-1.6 = Weak, 1.7-2.3 = Usable, 2.4-3.0 = Strong.
Freshness
Pricing is refreshed nightly from OpenRouter and LiteLLM APIs. New models are automatically benchmarked when detected. Quirks (parameter requirements, JSON mode support, tool calling) are probed automatically before benchmarking.
Data Sources
- Pricing: OpenRouter API + LiteLLM model registry
- Benchmarks: BYOM engine via OpenRouter
- Quirk detection: Automated probe suite (9 tests per model)