Best Local LLMs for Coding | ModelPicker

The open-weight model ecosystem is genuinely impressive. Meta's Llama 4 family, Qwen's code-specialized variants, Mistral's latest releases, and DeepSeek's coding models have all shipped in the past twelve months — and several of them beat GPT-4-class models on standard coding benchmarks.

The caveat: “local” means different things. Running a 70B parameter model locally requires a serious GPU. A 7B or 14B model runs comfortably on consumer hardware.

Why run locally

Privacy. Your code never leaves your machine. For proprietary codebases or anything under NDA, local inference eliminates the data-leakage vector.
Cost. A one-time GPU purchase amortizes to effectively zero inference cost at high volume.
Latency. No network round-trip. First-token latency on a 7B model with an M3 Max is under 50ms.
Offline. Flights, hotel networks, air-gapped environments — your coding assistant works regardless.

Hardware requirements

Model size is measured in parameters, and the practical limit is your GPU VRAM (or unified memory on Apple Silicon):

7B–14B models · 8–16GB VRAM. MacBook Pro M2/M3/M4 with 16–24GB, RTX 3060 12GB, RTX 4070. Good for autocomplete and simple refactors.
32B models · 20–32GB VRAM. RTX 3090/4090, Mac Studio M2/M3 Ultra. Noticeably better reasoning. Handles multi-file changes.
70B+ models · 48GB+ VRAM. RTX 6000 Ada, A100/H100, Mac Studio M3 Ultra 192GB. Near-frontier quality.

Live data · open-weight models, ranked by coding score

Model	Provider	Avg	Code	Ctx
R1 0528	DeepSeek	4.50	—	164K
Gemini 3 Flash Preview	Google	4.50	—	1.0M
Qwen: Qwen3.6 Plus	Qwen	4.50	—	1M
Gemini 3.1 Flash Lite Preview	Google	4.42	—	1.0M
Gemma 4 31B	Google	4.42	—	262K
Gemini 3.1 Pro Preview	Google	4.33	—	1.0M
Qwen: Qwen3.5-9B	Qwen	4.27	—	262K
Gemini 2.5 Pro	Google	4.25	—	1.0M
DeepSeek V3.2	DeepSeek	4.25	—	131K
Gemma 4 26B A4B	Google	4.25	—	262K
Mistral Medium 3.1	Mistral	4.25	—	131K
Qwen: Qwen3.5-35B-A3B	Qwen	4.20	—	262K

Our pick for local coding

The right answer depends on your hardware:

Consumer laptop (16GB RAM, no discrete GPU): Qwen 2.5 Coder 7B Instruct. Runs in 4-bit quantization on Ollama, generates tokens fast enough to feel responsive.

Mac Studio / Pro with 32–64GB unified memory: DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B. Dedicated code training and larger context windows make a real difference for multi-file edits.

RTX 4090 or dual-GPU rig (48GB+ VRAM): Llama 4 Scout or DeepSeek-V3 at 4-bit. Near-frontier quality — genuinely useful for agentic coding tasks.

A note on tooling: Ollama is the fastest way to get started on Mac/Linux. LM Studio adds a nicer UI. For editor integration, Continue.dev (VS Code) and Cursor's local model support are the most polished options.