devtake.dev

Running a coding agent fully on Apple Silicon, no cloud, is now an off-the-shelf stack

A popular Hacker News how-to walked through a fully local coding agent on Apple Silicon. Here's the realistic 2026 stack: runner, model, and harness.

Dieter Morelli · · 5 min read · 6 sources
A MacBook Pro beside a Surface Book, both open on a white surface, USB-C ports in view
SimonWaldherr / CC BY-SA 4.0 via Wikimedia Commons · Source

Kyle Howells lost his internet a few times and got tired of it. “I’d had my internet fail a few times recently leaving me stranded without a coding agent,” he wrote in a how-to that climbed Hacker News last week. So he wired one up that needs no cloud at all, running on an M1 Max with 64GB of memory.

That post is the hook, but the bigger story is that the setup it describes stopped being exotic. Two years ago a fully local coding agent on a laptop was a demo you’d watch once and forget. In 2026 it’s three off-the-shelf pieces that snap together: a model runner, an open-weight coding model that fits in your Mac’s memory, and an agent harness that points at a local endpoint instead of a paid API. None of it touches a server you don’t own. Here’s the realistic stack, what each piece does, and where it still loses to the cloud.

The local stack at a glance

ToolWhat it doesWhy it’s trendingLink
OllamaOne-command model runnerPulls and serves models with a local OpenAI-compatible APIollama.com
LM StudioGUI runner, GGUF and MLXRuns Apple’s MLX format natively for faster Apple Silicon inferencelmstudio.ai
llama.cpp / MLXBare-metal runtimesMetal acceleration; what the HN post actually usedgithub.com/ml-explore/mlx-lm
Qwen3-CoderOpen coding modelA small-activation MoE that punches far above its active sizeollama.com/library/qwen3-coder
DevstralAgentic coding modelBuilt for harness use, scores 68% on SWE-bench Verifiedmistral.ai/news/devstral
Aider / Cline / OpenCodeAgent harnessesPoint at any local endpoint, do real multi-file editsaider.chat

The runner: Ollama, LM Studio, or raw llama.cpp

The runner loads the model and exposes it over HTTP. Ollama is the low-friction default: ollama pull qwen3-coder:30b-a3b, and you get an OpenAI-compatible endpoint at http://localhost:11434/v1. LM Studio wraps the same job in a GUI and, more usefully, runs both GGUF and Apple’s MLX format in one app, so you can A/B the two on identical hardware. MLX is the one that matters on a Mac: Apple’s ml-explore team built it to use unified memory directly, which is why MLX builds often beat GGUF on the same chip.

Howells skipped the wrappers and ran llama-server from llama.cpp directly, built with Metal. More setup, more control. If you just want it working before lunch, start with Ollama and graduate later.

The model: Qwen3-Coder, Devstral, or gpt-oss

This is where the last year of progress shows up. Qwen3-Coder in its 30B-A3B form is a mixture-of-experts model: 30 billion parameters on disk, but only about 3 billion active per token, so it runs at roughly 30-35 tokens per second on an M4 Pro while needing around 17GB at 4-bit quantization. That MoE trick is the whole reason a coding model this capable fits in consumer memory at all.

Devstral, from Mistral and All Hands AI, was built specifically to drive agent harnesses rather than chat. The Small variant scores 68% on SWE-bench Verified, runs on a 32GB Mac, and ships under Apache 2.0. gpt-oss-20b is OpenAI’s first open-weight release since GPT-2, also Apache 2.0; OpenAI says it runs on 16GB of memory and lands near o3-mini on common benchmarks. Howells himself ran a Gemma variant for speed and reached for Qwen3.6 35B-A3B when he wanted stronger code, clocking it at 55 tokens per second.

The harness: Aider, Cline, Continue, OpenCode

A model alone doesn’t edit your repo. The harness does the agent work: reading files, planning a change, applying diffs, running tests. The good news is they’re all built around the same OpenAI-compatible shape, so any of them points at your local runner. Aider connects to Ollama or any local endpoint and is the terminal-native pick. Cline and Continue live in VS Code; you set the base URL to http://localhost:11434 and pick your model from a dropdown. OpenCode is the newer CLI option, with explicit local-provider examples in its docs. Swapping the cloud key for a localhost URL is the entire migration.

Where this still loses to the cloud

Be honest about the trade. A local 30B model is not Claude or GPT-5, and the gap shows up in three places. Speed: 30 to 55 tokens per second is usable but slower than a hosted frontier model, and a long agent loop feels it. Context: cloud models routinely take 200k-plus token contexts, while a quantized local model running comfortably in 32GB is happier with far less, so it loses the thread on big multi-file refactors. Tool-use reliability: frontier models are better at calling tools correctly turn after turn, and a weaker local model drops or malforms calls more often, which is exactly what breaks an agent mid-task.

So who is this actually for? Three groups. Anyone who can’t send code to a third party, which covers a lot of regulated and contractual work. Anyone who codes where the internet doesn’t reach, the problem that started Howells down this road. And anyone tired of per-token bills, since once the weights are on disk the marginal cost is electricity. We covered the opposite bet, OpenAI pushing Codex deeper into the cloud and onto your Mac’s controls, and the hardware arms race that makes bigger local models thinkable. The agent design questions are the same ones in SQLite’s refusal to merge agentic code, and the speed obsession echoes how Linear stays fast.

Our pick: Ollama plus Qwen3-Coder plus Aider

Start there. Ollama removes the setup tax, Qwen3-Coder is the best capability-per-gigabyte coding model that fits 32GB today, and Aider is the most forgiving harness when a local model fumbles a tool call. Get that combo answering in your own repo first. If you outgrow it, the upgrade path is clear: move to LM Studio for MLX speed, or swap in Devstral when you want a model tuned for the harness rather than the chat box. My read: treat this as your offline and privacy fallback, not a full replacement for Claude on the gnarly refactors. For everything else, it’s good enough now, and that sentence wasn’t true a year ago.

Share this article

Quick reference

MoE
Mixture-of-experts, a model design that routes each request to a small subset of specialized sub-networks, so a 1.2-trillion-parameter model only fires a fraction of itself per query.
MLX
Apple's array framework for Apple Silicon. It uses the chip's unified memory so the CPU and GPU share one pool, which speeds up local model inference on a Mac.
unified memory
Apple Silicon's single high-bandwidth memory pool shared by CPU, GPU, and Neural Engine, so model weights load once without a separate VRAM copy.

Sources

Mentioned in this article