The open-weight model ecosystem is genuinely impressive. Meta's Llama 4 family, Qwen's code-specialized variants, Mistral's latest releases, and DeepSeek's coding models have all shipped in the past twelve months — and several of them beat GPT-4-class models on standard coding benchmarks.
The caveat: “local” means different things. Running a 70B parameter model locally requires a serious GPU. A 7B or 14B model runs comfortably on consumer hardware.
Why run locally
- Privacy. Your code never leaves your machine. For proprietary codebases or anything under NDA, local inference eliminates the data-leakage vector.
- Cost. A one-time GPU purchase amortizes to effectively zero inference cost at high volume.
- Latency. No network round-trip. First-token latency on a 7B model with an M3 Max is under 50ms.
- Offline. Flights, hotel networks, air-gapped environments — your coding assistant works regardless.
Hardware requirements
Model size is measured in parameters, and the practical limit is your GPU VRAM (or unified memory on Apple Silicon):
- 7B–14B models · 8–16GB VRAM. MacBook Pro M2/M3/M4 with 16–24GB, RTX 3060 12GB, RTX 4070. Good for autocomplete and simple refactors.
- 32B models · 20–32GB VRAM. RTX 3090/4090, Mac Studio M2/M3 Ultra. Noticeably better reasoning. Handles multi-file changes.
- 70B+ models · 48GB+ VRAM. RTX 6000 Ada, A100/H100, Mac Studio M3 Ultra 192GB. Near-frontier quality.
| Model | Provider | Avg | Code | Ctx |
|---|---|---|---|---|
| R1 0528 | DeepSeek | 4.50 | — | 164K |
| Gemini 3 Flash Preview | 4.50 | — | 1.0M | |
| Qwen: Qwen3.6 Plus | Qwen | 4.50 | — | 1M |
| Gemini 3.1 Flash Lite Preview | 4.42 | — | 1.0M | |
| Gemma 4 31B | 4.42 | — | 262K | |
| Gemini 3.1 Pro Preview | 4.33 | — | 1.0M | |
| Qwen: Qwen3.5-9B | Qwen | 4.27 | — | 262K |
| Gemini 2.5 Pro | 4.25 | — | 1.0M | |
| DeepSeek V3.2 | DeepSeek | 4.25 | — | 131K |
| Gemma 4 26B A4B | 4.25 | — | 262K | |
| Mistral Medium 3.1 | Mistral | 4.25 | — | 131K |
| Qwen: Qwen3.5-35B-A3B | Qwen | 4.20 | — | 262K |
Our pick for local coding
The right answer depends on your hardware:
Consumer laptop (16GB RAM, no discrete GPU): Qwen 2.5 Coder 7B Instruct. Runs in 4-bit quantization on Ollama, generates tokens fast enough to feel responsive.
Mac Studio / Pro with 32–64GB unified memory: DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B. Dedicated code training and larger context windows make a real difference for multi-file edits.
RTX 4090 or dual-GPU rig (48GB+ VRAM): Llama 4 Scout or DeepSeek-V3 at 4-bit. Near-frontier quality — genuinely useful for agentic coding tasks.
A note on tooling: Ollama is the fastest way to get started on Mac/Linux. LM Studio adds a nicer UI. For editor integration, Continue.dev (VS Code) and Cursor's local model support are the most polished options.