/about

Built by developers who got tired of guessing.

I kept watching Claude and Cursor recommend gpt-4-turbo and claude-3-opus for new projects — models that are either deprecated or way overpriced for the job. I'd say "find something modern," and they'd guess again from stale training data. So I built ModelPicker: live benchmarks, live pricing, and an MCP server so your coding tools can actually look up the answer instead of guessing.

108

models tracked

benchmark tests

24h

pricing refresh

Methodology

Every model on ModelPicker is evaluated on the same 13 tests, scored by an LLM judge (Claude Sonnet 4.6) on a 1–5 scale. All API calls go through OpenRouter to the same provider endpoints you'd use in production. We also track three external benchmarks from Epoch AI (SWE-bench Verified, MATH Level 5, AIME 2025) where available.

Pricing is synced nightly from OpenRouter. Benchmarks are re-run when new models launch or when we detect meaningful changes.

Structured Output

JSON schema compliance and format adherence

Strategic Analysis

Nuanced tradeoff reasoning with real numbers

Constrained Rewriting

Compression within hard character limits

Creative Problem Solving

Non-obvious, specific, feasible ideas

Tool Calling

Function selection, argument accuracy, sequencing

Faithfulness

Sticks to source material without hallucinating

Classification

Accurate categorization and routing

Long Context

Retrieval accuracy at 30K+ tokens

Safety Calibration

Refuses harmful requests, permits legitimate ones

Persona Consistency

Maintains character and resists injection

Agentic Planning

Goal decomposition and failure recovery

Multilingual

Equivalent quality output in non-English languages

Tabular Data

Reading, reasoning over, and extracting from tables and spreadsheets

What makes this different

Chatbot Arena measures user preference in pairwise comparisons. That's valuable for consumer chatbots, but it doesn't tell you how a model will perform on structured output or tool calling in your production pipeline.

Artificial Analysis does solid intelligence and throughput benchmarks. We focus on task-specific performance — what matters when you're choosing a model for a specific use case, not just ranking models overall.

Provider marketing pages cherry-pick the benchmarks where their model wins. We run every model through the same tests and let the scores speak.

Contamination and fairness

Every model gets the same prompt template, the same temperature (0.7), and the same max_tokens cap (1,500 for most tests). Models that don't support temperature (reasoning models like o3, DeepSeek R1) use the provider's default settings. We don't give anyone a head start.

Scoring is single-run per benchmark. We know this introduces noise — a model can score differently on the same test across runs. Multi-run scoring (median of 3) is on our roadmap.

Who's behind this

A developer who kept getting asked "which model should I use?" and got tired of giving the same answer. ModelPicker doesn't take money from model providers. The data is free. The MCP server is free.

Have questions about methodology or think we're wrong about something? Use the chat on any page — we read every message.