Running a coding agent fully on Apple Silicon, no cloud, is now an off-the-shelf stack

A popular Hacker News how-to walked through a fully local coding agent on Apple Silicon. Here's the realistic 2026 stack: runner, model, and harness.

Kyle Howells lost his internet a few times and got tired of it. “I’d had my internet fail a few times recently leaving me stranded without a coding agent,” he wrote in a how-to that climbed Hacker News last week. So he wired one up that needs no cloud at all, running on an M1 Max with 64GB of memory.

That post is the hook, but the bigger story is that the setup it describes stopped being exotic. Two years ago a fully local coding agent on a laptop was a demo you’d watch once and forget. In 2026 it’s three off-the-shelf pieces that snap together: a model runner, an open-weight coding model that fits in your Mac’s memory, and an agent harness that points at a local endpoint instead of a paid API. None of it touches a server you don’t own. Here’s the realistic stack, what each piece does, and where it still loses to the cloud.

The local stack at a glance

Tool	What it does	Why it’s trending	Link
Ollama	One-command model runner	Pulls and serves models with a local OpenAI-compatible API	ollama.com
LM Studio	GUI runner, GGUF and MLX	Runs Apple’s MLX format natively for faster Apple Silicon inference	lmstudio.ai
llama.cpp / MLX	Bare-metal runtimes	Metal acceleration; what the HN post actually used	github.com/ml-explore/mlx-lm
Qwen3-Coder	Open coding model	A small-activation MoE that punches far above its active size	ollama.com/library/qwen3-coder
Devstral	Agentic coding model	Built for harness use, scores 68% on SWE-bench Verified	mistral.ai/news/devstral
Aider / Cline / OpenCode	Agent harnesses	Point at any local endpoint, do real multi-file edits	aider.chat

The runner: Ollama, LM Studio, or raw llama.cpp

The runner loads the model and exposes it over HTTP. Ollama is the low-friction default: ollama pull qwen3-coder:30b-a3b, and you get an OpenAI-compatible endpoint at http://localhost:11434/v1. LM Studio wraps the same job in a GUI and, more usefully, runs both GGUF and Apple’s MLX format in one app, so you can A/B the two on identical hardware. MLX is the one that matters on a Mac: Apple’s ml-explore team built it to use unified memory directly, which is why MLX builds often beat GGUF on the same chip.

Howells skipped the wrappers and ran llama-server from llama.cpp directly, built with Metal. More setup, more control. If you just want it working before lunch, start with Ollama and graduate later.

The model: Qwen3-Coder, Devstral, or gpt-oss

This is where the last year of progress shows up. Qwen3-Coder in its 30B-A3B form is a mixture-of-experts model: 30 billion parameters on disk, but only about 3 billion active per token, so it runs at roughly 30-35 tokens per second on an M4 Pro while needing around 17GB at 4-bit quantization. That MoE trick is the whole reason a coding model this capable fits in consumer memory at all.

Devstral, from Mistral and All Hands AI, was built specifically to drive agent harnesses rather than chat. The Small variant scores 68% on SWE-bench Verified, runs on a 32GB Mac, and ships under Apache 2.0. gpt-oss-20b is OpenAI’s first open-weight release since GPT-2, also Apache 2.0; OpenAI says it runs on 16GB of memory and lands near o3-mini on common benchmarks. Howells himself ran a Gemma variant for speed and reached for Qwen3.6 35B-A3B when he wanted stronger code, clocking it at 55 tokens per second.

The harness: Aider, Cline, Continue, OpenCode

A model alone doesn’t edit your repo. The harness does the agent work: reading files, planning a change, applying diffs, running tests. The good news is they’re all built around the same OpenAI-compatible shape, so any of them points at your local runner. Aider connects to Ollama or any local endpoint and is the terminal-native pick. Cline and Continue live in VS Code; you set the base URL to http://localhost:11434 and pick your model from a dropdown. OpenCode is the newer CLI option, with explicit local-provider examples in its docs. Swapping the cloud key for a localhost URL is the entire migration.

Where this still loses to the cloud

Be honest about the trade. A local 30B model is not Claude or GPT-5, and the gap shows up in three places. Speed: 30 to 55 tokens per second is usable but slower than a hosted frontier model, and a long agent loop feels it. Context: cloud models routinely take 200k-plus token contexts, while a quantized local model running comfortably in 32GB is happier with far less, so it loses the thread on big multi-file refactors. Tool-use reliability: frontier models are better at calling tools correctly turn after turn, and a weaker local model drops or malforms calls more often, which is exactly what breaks an agent mid-task.

So who is this actually for? Three groups. Anyone who can’t send code to a third party, which covers a lot of regulated and contractual work. Anyone who codes where the internet doesn’t reach, the problem that started Howells down this road. And anyone tired of per-token bills, since once the weights are on disk the marginal cost is electricity. We covered the opposite bet, OpenAI pushing Codex deeper into the cloud and onto your Mac’s controls, and the hardware arms race that makes bigger local models thinkable. The agent design questions are the same ones in SQLite’s refusal to merge agentic code, and the speed obsession echoes how Linear stays fast.

Our pick: Ollama plus Qwen3-Coder plus Aider

Start there. Ollama removes the setup tax, Qwen3-Coder is the best capability-per-gigabyte coding model that fits 32GB today, and Aider is the most forgiving harness when a local model fumbles a tool call. Get that combo answering in your own repo first. If you outgrow it, the upgrade path is clear: move to LM Studio for MLX speed, or swap in Devstral when you want a model tuned for the harness rather than the chat box. My read: treat this as your offline and privacy fallback, not a full replacement for Claude on the gnarly refactors. For everything else, it’s good enough now, and that sentence wasn’t true a year ago.

Running a coding agent fully on Apple Silicon, no cloud, is now an off-the-shelf stack

The local stack at a glance

The runner: Ollama, LM Studio, or raw llama.cpp

The model: Qwen3-Coder, Devstral, or gpt-oss

The harness: Aider, Cline, Continue, OpenCode

Where this still loses to the cloud

Our pick: Ollama plus Qwen3-Coder plus Aider

Share this article

Quick reference

Sources

Mentioned in this article