metalstream: running a 31B model on a 16GB laptop by streaming weights off NVMe

The premise is simple and a little absurd: take Gemma 4 31B - a quantized model that is 18.44 GB on disk - and make it produce tokens on a 16 GB M4 MacBook Air, a machine whose GPU will only let one process wire down 10.67 GB at a time. The model is nearly twice the size of the hole it has to fit through. metalstream is the substrate that makes it run anyway, by never holding the whole model in memory at once.

The wall: a ceiling you cannot configure away

I did not start by writing a streaming engine. I started by trying the cheap things and proving they fail, because a substrate this invasive needs to be justified, not assumed. Phase 0 asked the most basic question - does a bare lazy=True memory-mapped load just work? The validation model (Gemma 4 E4B, 5.25 GB) ran cleanly at ~33 tok/s, confirming the harness and loader were sound. The 31B hero model OOM'd at the Metal command buffer on first use.

Phase 1 asked whether MLX's memory-tier knobs - set_wired_limit, set_memory_limit, set_cache_limit - could keep it resident. They cannot. The Metal driver on this M4 exposes a hard max_recommended_working_set_size of 10.67 GB; any wired limit above that is silently rejected. A dense 18.44 GB model cannot fit in a 10.67 GB working set no matter how you set the dials. Even a Q3 re-quant would land around 13 GB - still over the cap. Streaming wasn't a nice-to-have; it was the only door left.

The dominant fact of the whole project. The 31B model on disk is 1.7× the GPU's single-process working-set ceiling, so no combination of memory-tier knobs can make it resident. The validation model fits comfortably and anchors every benchmark.

The idea: "LLM in a Flash," on Metal

A transformer doesn't need every layer's weights at the same instant. During a forward pass it visits decoder layers in order - layer 0, then 1, then 2, all the way to 59 - and once it has left a layer, those weights are dead weight until the next token. metalstream exploits exactly this: load the weights for layer i when the pass enters it, evict them when the pass leaves, and prefetch layer i+1 in the background while layer i computes. The GPU only ever holds a couple of layers at a time, so the resident footprint is tiny even though the model on disk is enormous.

Per-layer streaming: the active layer is materialized into a small GPU working set, the next layer is prefetched while it computes, and the spent layer is evicted so its slot can be reused. Peak residency is a couple of layers, not the whole model.

With resident_blocks=2 and prefetch_distance=1, the 31B model that categorically OOM'd in Phases 0 and 1 produces tokens end-to-end. And its resident GPU footprint is almost comically small relative to the cap that was blocking it - a peak Metal pool of 2.35 GB, roughly one-fifth of the 10.67 GB ceiling, with peak RSS of 1.85 GB.

The streamed hero leaves ~78% of the working-set budget on the table. That headroom is what Phase 3 went hunting in - and what turned out not to be the bottleneck at all.

It runs - but the wall just moved

Honesty time: streaming doesn't make the 31B fast. It makes it possible. The measured decode rate is 0.13 tok/s. That's "your overnight job finishes while you sleep," not "you type and it answers." Streaming traded the OOM wall for an SSD-bandwidth-and-layer-load wall - exactly what the "LLM in a Flash" framing predicts. The contrast with the resident validation model is the real story:

The throughput cliff. Fitting in RAM buys interactive speed; streaming off NVMe buys feasibility at roughly 250× the per-token cost. The honest headline is "it runs at all," not "it runs fast."

Why a bigger cache doesn't help

With 8 GB of unused pool budget sitting on the table, the obvious next move is: grow resident_blocks from 2 to a real LRU cache and let layer reuse amortize the load cost. Phase 3 swept N ∈ {2, 4, 8, 16} and the result was a flat line - decode stayed at 0.12-0.13 tok/s, with layer_loads = 23,520 reproduced byte-for-byte at every residency budget.

The pathology: a forward pass visits 60 decoder layers in order. An LRU cache of size N < 60 always evicts each layer just before its next entry point comes back around - so the cache hit rate is 0% regardless of N. Buying more residency for a strictly sequential access pattern buys nothing. The throughput floor here is on-GPU compute, not storage I/O.

That's the kind of finding that only falls out of measuring instead of guessing. It also reframes the remaining work: the path forward isn't "tune the cache," it's architectural - KV streaming, quantization tier-mixing, a smarter-than-LRU policy that respects the sequential structure.

Killing the eviction overhead

Phase 3's instrumentation surfaced an embarrassing dominator. Of the hero run's wall time, ~910 s was spent in _evict() - 2.7× the actual load path - almost all of it in a refresh-cache → clear-cache → gc.collect round-trip that MLX needed in order to hand its GPU buffer back to the pool after every single layer eviction. We were paying a garbage-collection tax 23,520 times.

Phase 4 replaced it with a persistent LayerBufferPool: K slots, each owning the canonical array references for one resident layer. Slot reuse is the eviction - no cache refresh, no clear_cache, no gc.collect - and the old-layer rebind is batched with the new-layer install into a single weight-load call.

The buffer pool eliminates the eviction-path overhead by construction: 2× fewer mx.load round-trips and a 19.5% cut in MLX-side wall time, with bit-identical numerical output on the validation model and the K=2 residency bound enforced by the pool itself.

And here is the honest punchline: the bench wall only dropped 5.9%, and decode stayed flat at 0.12 tok/s. We deleted the substrate's biggest self-inflicted cost and confirmed what Phase 3 already suspected - the remaining wall is on-GPU compute. The dominator we attacked is gone; the next one is the matmuls themselves.

A worked-backwards investigation

What I'm proudest of here isn't the streaming engine - it's the method. Every phase poses one falsifiable question, runs it on the same fingerprinted machine, and commits the JSON for every published number. Negative results are first-class: "tier knobs don't work" and "a bigger cache does nothing" are load-bearing findings, not failures to hide.

Phase	Question	Verdict
0	Does bare lazy-mmap run the 31B?	No - OOMs. E4B validates the harness at 32.9 tok/s.
1	Can memory-tier knobs keep it resident?	No - the 10.67 GB working-set cap is hard; the model is 18.44 GB.
2	Can per-layer streaming host it?	Yes - 0.13 tok/s, 2.35 GB pool. OOM wall → I/O wall.
3	Does a bigger residency cache help?	No - LRU vs sequential access = 0% hits. Compute-bound.
4	Can a buffer pool kill the evict tax?	Yes - −19.5% MLX work; the compute floor remains.

18.44 GBmodel on disk

10.67 GBMetal cap

2.35 GBpeak pool, streamed

16 GBM4 Air it runs on

What it proves

metalstream is a Python library - a drop-in MLX add-on, not a fork - and the headline finding is narrow and real: a 31B Q4 model now produces tokens on a 16 GB Air at all, slowly and honestly, with bit-identical numerical equivalence to the non-streaming baseline. More importantly, the project maps the terrain: the ceiling is a fixed Metal working-set cap, streaming clears it, the residual cost is GPU compute rather than storage, and the eviction overhead is an implementation tax you can delete.

It is not fast, and it is structurally Apple-only - Metal is the whole point. But it's the substrate that lets a model whose reasoning ceiling used to require a $4k laptop run on the MacBook Air you already own. Paired with aircoder - which is small-context-by-design - you get the two halves of one bet: an agent that survives long tasks, and a runtime that survives oversized models.

metalstream is an MIT-licensed Python package with a reproducible benchmark harness; every published number ships with its result JSON. Test bed: Apple M4 (10c/10c), 16 GB unified memory, macOS 15.7, MLX 0.31.

metalstream: running a 31B model on a 16 GB laptop by streaming weights off NVMe

The wall: a ceiling you cannot configure away

The idea: "LLM in a Flash," on Metal

It runs - but the wall just moved

Why a bigger cache doesn't help

Killing the eviction overhead

A worked-backwards investigation

What it proves