guides/best-local-llm10 min read

Best local LLMs for coding

Open-weight models have closed the gap with frontier APIs for everyday coding work. If you have a decent GPU — or even just 16GB of RAM — you can run a model that outperforms last year's best commercial offering.

The open-weight model ecosystem is genuinely impressive. Meta's Llama 4 family, Qwen's code-specialized variants, Mistral's latest releases, and DeepSeek's coding models have all shipped in the past twelve months — and several of them beat GPT-4-class models on standard coding benchmarks.

The caveat: “local” means different things. Running a 70B parameter model locally requires a serious GPU. A 7B or 14B model runs comfortably on consumer hardware.

Why run locally

  • Privacy. Your code never leaves your machine. For proprietary codebases or anything under NDA, local inference eliminates the data-leakage vector.
  • Cost. A one-time GPU purchase amortizes to effectively zero inference cost at high volume.
  • Latency. No network round-trip. First-token latency on a 7B model with an M3 Max is under 50ms.
  • Offline. Flights, hotel networks, air-gapped environments — your coding assistant works regardless.

Hardware requirements

Model size is measured in parameters, and the practical limit is your GPU VRAM (or unified memory on Apple Silicon):

  1. 7B–14B models · 8–16GB VRAM. MacBook Pro M2/M3/M4 with 16–24GB, RTX 3060 12GB, RTX 4070. Good for autocomplete and simple refactors.
  2. 32B models · 20–32GB VRAM. RTX 3090/4090, Mac Studio M2/M3 Ultra. Noticeably better reasoning. Handles multi-file changes.
  3. 70B+ models · 48GB+ VRAM. RTX 6000 Ada, A100/H100, Mac Studio M3 Ultra 192GB. Near-frontier quality.
Live data · open-weight models, ranked by coding score
ModelProviderAvgCodeCtx
R1 0528DeepSeek4.50164K
Gemini 3 Flash PreviewGoogle4.501.0M
Qwen: Qwen3.6 PlusQwen4.501M
Gemini 3.1 Flash Lite PreviewGoogle4.421.0M
Gemma 4 31BGoogle4.42262K
Gemini 3.1 Pro PreviewGoogle4.331.0M
Qwen: Qwen3.5-9BQwen4.27262K
Gemini 2.5 ProGoogle4.251.0M
DeepSeek V3.2DeepSeek4.25131K
Gemma 4 26B A4B Google4.25262K
Mistral Medium 3.1Mistral4.25131K
Qwen: Qwen3.5-35B-A3BQwen4.20262K

Our pick for local coding

The right answer depends on your hardware:

Consumer laptop (16GB RAM, no discrete GPU): Qwen 2.5 Coder 7B Instruct. Runs in 4-bit quantization on Ollama, generates tokens fast enough to feel responsive.

Mac Studio / Pro with 32–64GB unified memory: DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B. Dedicated code training and larger context windows make a real difference for multi-file edits.

RTX 4090 or dual-GPU rig (48GB+ VRAM): Llama 4 Scout or DeepSeek-V3 at 4-bit. Near-frontier quality — genuinely useful for agentic coding tasks.

A note on tooling: Ollama is the fastest way to get started on Mac/Linux. LM Studio adds a nicer UI. For editor integration, Continue.dev (VS Code) and Cursor's local model support are the most polished options.