/about

Built by developers who got tired of guessing.

I kept watching Claude and Cursor recommend gpt-4-turbo and claude-3-opus for new projects — models that are either deprecated or way overpriced for the job. I'd say "find something modern," and they'd guess again from stale training data. So I built ModelPicker: live benchmarks, live pricing, and an MCP server so your coding tools can actually look up the answer instead of guessing.

63
models tracked
13
benchmark tests
24h
pricing refresh

Methodology

Every model on ModelPicker is evaluated on the same 13 tests, scored by an LLM judge (Claude Sonnet 4.6) on a 1–5 scale. All API calls go through OpenRouter to the same provider endpoints you'd use in production. We also track three external benchmarks from Epoch AI (SWE-bench Verified, MATH Level 5, AIME 2025) where available.

Pricing is synced nightly from OpenRouter. Benchmarks are re-run when new models launch or when we detect meaningful changes.

Structured Output
JSON schema compliance and format adherence
Strategic Analysis
Nuanced tradeoff reasoning with real numbers
Constrained Rewriting
Compression within hard character limits
Creative Problem Solving
Non-obvious, specific, feasible ideas
Tool Calling
Function selection, argument accuracy, sequencing
Faithfulness
Sticks to source material without hallucinating
Classification
Accurate categorization and routing
Long Context
Retrieval accuracy at 30K+ tokens
Safety Calibration
Refuses harmful requests, permits legitimate ones
Persona Consistency
Maintains character and resists injection
Agentic Planning
Goal decomposition and failure recovery
Multilingual
Equivalent quality output in non-English languages
Tabular Data
Reading, reasoning over, and extracting from tables and spreadsheets

What makes this different

Chatbot Arena measures user preference in pairwise comparisons. That's valuable for consumer chatbots, but it doesn't tell you how a model will perform on structured output or tool calling in your production pipeline.

Artificial Analysis does solid intelligence and throughput benchmarks. We focus on task-specific performance — what matters when you're choosing a model for a specific use case, not just ranking models overall.

Provider marketing pages cherry-pick the benchmarks where their model wins. We run every model through the same tests and let the scores speak.

Contamination and fairness

Every model gets the same prompt template, the same temperature (0.7), and the same max_tokens cap (1,500 for most tests). Models that don't support temperature (reasoning models like o3, DeepSeek R1) use the provider's default settings. We don't give anyone a head start.

Scoring is single-run per benchmark. We know this introduces noise — a model can score differently on the same test across runs. Multi-run scoring (median of 3) is on our roadmap.

Who's behind this

A developer who kept getting asked "which model should I use?" and got tired of giving the same answer. ModelPicker doesn't take money from model providers. The data is free. The MCP server is free.

Have questions about methodology or think we're wrong about something? Use the chat on any page — we read every message.