The 3B Wall: What Apple’s On-Device LLM Can and Can’t Do in Your Shell

8 minute read

Apple quietly shipped a 3B-parameter LLM on every Mac running macOS Tahoe. It sits on the Neural Engine, powers Siri and Writing Tools — and that’s it. No public API, no CLI, no way to call it from your own code outside of Xcode.

Then apfel showed up — a Swift CLI that wraps Apple’s FoundationModels framework and gives the on-device model a command-line interface. It hit the front page of Hacker News, and it got me thinking about possible applications.

macOS’s default shell, zsh, has hooks that fire before you run a command, when a command isn’t found, and after a command fails. They’ve always been there, but the options were either deterministic (Levenshtein-based “did you mean”), or an LLM that’s either too slow locally or requires a cloud roundtrip. A 3B model on the Neural Engine responding in under a second changes the tradeoff.

hunch — on-device LLM shell commands

I wired all three hooks to the on-device model. It worked for simple things, then started hallucinating flags. So I benchmarked multiple approaches across 100 prompts to find out what actually helps, built the winning approach into a CLI called hunch, and learned something about where the 3B wall actually is.

The three hooks

zsh has three built-in extension points that fire at specific moments in the command lifecycle. Each one is a natural place to put an LLM call:

Hook	When it fires	What I wired it to
`zle` widget (Ctrl+G)	Before you run a command	Natural language → shell command. Replaces the buffer, you inspect before hitting Enter. Never executes anything.
`command_not_found_handler`	Command isn’t in `$PATH`	`gti push` → `did you mean: git push`. `ip a` → `did you mean: ifconfig`.
`TRAPZERR`	Non-zero exit code	One-line explanation of what went wrong, in dim grey. Skips signals, benign exits (`grep` no-match, `diff`), and commands containing tokens or passwords.

The Ctrl+G hook is the main one. Type a description, hit Ctrl+G, the buffer gets replaced:

Ctrl+G demo — type a description, hit Ctrl+G, get the command

The key safety property: Ctrl+G never runs anything. It fills the buffer. You always read before you execute. This turns out to be load-bearing — not safety theater — because the model is wrong a third of the time.

The 40% baseline

Simple commands work. Typo correction is reliable. But the moment you ask for anything with specific flags, the model hallucinates.

find files changed in the last hour → find . -mtime +1h

This is wrong in three ways (-mtime counts days, + means “more than,” the h suffix doesn’t exist). The correct command is find . -mmin -60.

I ran 100 prompts — 31 simple, 51 flag-heavy, 18 composed — and scored each result. Baseline: 40% usable. 60% wrong.

The model is also oblivious to macOS. “Show my IP address” returns ip a — Linux, doesn’t exist on macOS. It doesn’t know pbcopy, caffeinate, mdfind, pmset, or sips. It reaches for systemctl, lsusb, iwconfig every time.

Some hallucinations are dangerous. Asked for a soft git reset, it generated git reset --hard HEAD~1 — destroys uncommitted changes. The inspect-before-run design isn’t safety theater. It’s load-bearing.

What I tried

I tried a few approaches to improve accuracy. They fall into four categories.

Approaches overview

Help the model reason better. Give it reference material so it can look up the right flags. I tried parsing man pages into flag indexes, fetching tldr pages as documentation, grepping man pages for relevant keywords, and hardcoding cheat sheets in the system prompt. The model sees the right flag in the docs but can’t reason about it and apply it.

Let the model self-correct. Run it multiple times and pick the majority answer (self-consistency). Or generate a command then ask “is this correct for macOS? if not, fix it” (self-critique). Neither helps — deterministic output means every run returns the same wrong answer, and the model “fixes” correct commands into wrong ones.

Give it solved problems to copy. Instead of documentation, inject Q/A pairs the model can pattern-match against — “find files larger than 100MB” → find . -size +100M. Static few-shot uses 8 fixed examples. Dynamic few-shot picks the 8 most similar from a bank using token-overlap similarity. Accuracy depends on bank size — 76 hand-crafted examples scored 69% but that was inflated by test-set leakage. I went back to tldr, this time using it as solved examples instead of documentation: 20k+ community-written Q/A pairs for ~3k commands, parsed into a SQLite FTS5 index with additional macOS-specific overrides. This is what hunch ships.

Add sampling diversity. At temperature 0 the model is deterministic — majority vote across multiple runs returns the same wrong answer. Without the example bank, raising temperature to 0.3 with 3 samples actually makes things worse (39% vs 41%). But on top of the example bank it pushes accuracy to 73%: examples anchor the direction, temperature explores alternatives, majority vote filters outliers. Tradeoff: 1.3s instead of 0.4s.

Approach	What it does	Usable	Avg Time**
dynshot-tldr + sc	dynshot-tldr with temp 0.3, 3 samples, majority vote	73%	1.3s
dynshot-tldr	8 similar examples from 21k tldr Q/A pairs (FTS5)	66%	0.4s
dynshot	8 similar examples from 76 hand-crafted Q/A pairs	69%*	0.9s
fewshot	8 static hand-picked examples	43%	1.1s
permissive	bare prompt, relaxed guardrails	41%	0.3s
selfconsist	3 samples, majority vote (temp 0 — deterministic, useless)	41%	1.1s
minimal	bare prompt	40%	0.4s
minimal + sc	bare prompt, temp 0.3, 3 samples, majority vote	39%	1.3s
tldr	tldr page fed as documentation context	38%	1.4s
manindex	man page parsed into flag index	37%	1.5s
verify	generate then self-critique	33%	0.7s

* Biased — hold-out test showed real gain was only +4pp. ** Prototype timings (Python + apfel); hunch CLI times in bold rows.

The full benchmark suite (100 prompts, all approaches, raw results) is in the hunch repo.

The 3B wall

The benchmark results tell a specific story about what this model can and can’t do.

It can’t reason from documentation. I gave it man page flag indexes, tldr pages as context, cheat sheets, targeted doc sections. It finds the right flag name — it sees -mmin — but outputs -mmin 1 instead of -mmin -60. The +n/-n/n semantics are beyond what it can derive from a description. Man page keyword grep fails too: man pages don’t use words like “hour” or “changed,” so naive search returns nothing. None of these approaches beat a bare prompt.

It can’t self-correct. Self-critique (generate then ask “is this correct?”) dropped accuracy to 33% — worse than baseline. The model “fixes” correct commands into wrong ones. If it can’t reason about flag semantics in the first pass, the second pass uses the same broken reasoning.

It can copy patterns. Give it "find . -mmin -60 finds files in the last hour" as a literal example and it outputs correctly — because it copies, not reasons. That’s why few-shot examples work and documentation doesn’t. And it’s why self-consistency only helps on top of the example bank: without examples, sampling explores variations around a wrong center. With examples, it explores variations around a roughly-right center, and majority vote filters the outliers.

The precise claim: Apple’s 3B on-device model can classify intent and copy patterns but cannot reason over documentation to derive correct usage. The model knows what tool to reach for but doesn’t know how to hold it.

The meta-lesson: the right question for a small on-device model isn’t “can it do this task?” but “can I decompose the task so the model only does the parts it’s strong at?”

hunch

The dynshot-tldr approach worked well enough that I built it into a proper tool. hunch is a small Swift CLI that calls FoundationModels directly — no apfel dependency — with the FTS5 search and tldr bank baked into a single binary. Three files: the binary (~1MB), a pre-built FTS5 database (4MB, 21k entries), and a zsh plugin.

brew install es617/tap/hunch
source ~/.local/share/hunch/hunch.zsh  # add to ~/.zshrc

The default ships fast — single sample, deterministic output, 66% accuracy in 0.4s. If you want to trade speed for accuracy, bumping temperature to 0.3 and taking 3 samples with majority vote gets you to 73% at ~1.3s. Both are configurable via CLI flags or environment variables. Everything runs on the Neural Engine. No cloud, no API keys, no data leaves your Mac.

73% means it gets the commands you actually reach for — “show disk usage sorted by size,” “find large files,” “compress this directory,” “show my git log as a graph.” The 27% it gets wrong tends to be exotic — awk '{s+=$1} END {print s}', comm -12 <(sort f1) <(sort f2), multi-step pipelines with process substitution. Those are the ones you’d double-check anyway. The example bank is extensible — hunch ships with 60 macOS-specific overrides on top of the tldr corpus, and you can add your own.

A couple of things to know: FoundationModels is Tahoe only (macOS 26 — Sequoia and earlier don’t have it), and Apple’s guardrails are inconsistent — kill whatever is using port 3000 returns empty because the word “kill” triggers the safety filter. hunch uses permissiveContentTransformations to avoid false positives.

What this means

The benchmark tells a bigger story than shell commands. The model is bad at generating syntax but good at classifying intent and picking from options. That’s exactly what tool-calling requires — and FoundationModels supports it.

A 3B model that picks the right tool, passes the right arguments, and summarizes the result is more useful than one that tries to generate correct code from scratch. The right question for a small on-device model isn’t “can it do this task?” but “can I decompose the task so the model only does the parts it’s strong at?”

And it all runs locally — no roundtrip, no API key, no data leaving the machine. For anything latency-sensitive or privacy-sensitive, that matters. Every Tahoe Mac already has this capability sitting idle on the Neural Engine. hunch is one way to use it. There will be others.