Every local-LLM user has hit the same wall. A coding task is going fine for fifteen turns, then the agent forgets which file it was editing, hallucinates an import, loops on evidence it already gathered, or just runs out of memory when its KV cache outgrows your RAM. The model isn't dumb. It is forgetting. aircoder is my attempt to fix that - not with a bigger model, but with a different architecture.
The problem: the prompt is a terrible place to keep state
Cloud agents paper over the forgetting problem with frontier models and 200k-token context windows. Run locally on a laptop and there is no escape hatch: a 7B model with an 8k effective context cannot hold a twelve-step task, all the files it has read, and every test result in its head at once. As the conversation grows, three things happen - all bad:
- Dilution. The instruction that mattered on turn 2 is now buried under ten turns of tool output the model has to re-read every step.
- Drift. Re-reading a long, noisy history each turn, a small model loses the thread - it re-opens files it already saw, or contradicts a decision it made earlier.
- Death. The KV cache grows with the context until the process OOMs. On a 16 GB machine that can happen well before the task is done.
The conventional answer is "summarize the history when it gets long." That helps with dilution but not drift, and it is lossy in exactly the way that bites you later. aircoder takes the harder position: the prompt should never be the system of record in the first place.
The idea: state lives on disk, not in the prompt
aircoder decomposes every observation the agent makes into a typed evidence row in a SQLite ledger. Reading a file produces a file_read row. Running the tests produces a test_result row. Forming a hunch produces a decision row. The model never re-reads raw history - instead, each step assembles a fresh, compact prompt by querying the ledger for the handful of evidence rows that matter right now.
The supervisor recognizes seven kinds of evidence:
| Evidence kind | Produced when… |
|---|---|
file_read | the agent opens a file or a span of one |
diagnostic | a compiler/linter message - or an internal parser/inference failure - is captured |
test_result | a test run reports pass/fail |
edit_applied | a change is written to the working tree |
decision | the model commits to a hypothesis or a fix |
shell_output | an arbitrary command's output is recorded |
symbol_lookup | a symbol is resolved against the repo index |
Because every row carries a stable id, a short summary, and the verbatim payload, future steps can reference evidence by predicate ("the last failing test", "the decision about token_utils.py") rather than by scrolling back through a transcript. The model's context window stays nearly empty no matter how long the task runs.
The agent loop
One step of aircoder is a tight, four-beat cycle. The supervisor assembles a compact episode, the model answers with exactly one structured action, the supervisor commits the consequences to the ledger as a single transaction, and the next step queries the freshly-updated ledger. There is no growing transcript - every step starts from the durable state, not from a chat log.
The model's whole job each turn is to emit one JSON object on one line - no prose, no markdown fences, no commentary - picked from a small set of action shapes:
{"action":"record_evidence",
"kind":"file_read|diagnostic|test_result|edit_applied|decision|shell_output|symbol_lookup",
"subject":"<short identifier - a path, command, or symbol>",
"summary":"<one or two lines the next step will read>",
"content":"<verbatim payload bytes>"}
{"action":"spawn_child",
"hypothesis":"<one line: what this sub-investigation should explore>"}
{"action":"resolve",
"cites":["<evidence-id>", "..."],
"summary":"<why the task (or sub-task) is now done>"}
That structured-action contract is what makes the loop legible. The agent isn't free-forming a conversation; it's appending rows to a database and walking a plan tree. spawn_child opens a sub-hypothesis (depth-first investigation); resolve closes one and must cite the evidence ids that justify the conclusion - so every resolution is traceable back to the rows that earned it.
Every step is a checkpoint
A long-horizon agent has to survive SIGINT and kill -9. If your laptop sleeps, the model server hiccups, or you simply Ctrl-C a twenty-minute run, you should be able to resume without redoing work - and without a corrupt half-step poisoning the ledger.
aircoder gets this from a single design rule: everything a step mutates is committed in one SQLite transaction. The episode row, every evidence row it recorded, the plan-tree mutation, and the cursor bump all land inside one BEGIN … COMMIT:
// crates/aircoder-supervisor/src/supervisor.rs
let tx = conn.unchecked_transaction()?; // BEGIN DEFERRED
self.write_episode_row(&tx, &task, episode)?; // the model's prompt + action
match action {
Action::RecordEvidence(ev) => insert_evidence(&tx, ev)?,
Action::SpawnChild { .. } => insert_plan_node(&tx, ...)?,
Action::Resolve { .. } => mark_node_resolved(&tx, ...)?,
}
maybe_insert_plan_revision(&tx, action, ...)?; // structural change + its narrative, atomically
self.cursor.bump_in_tx(&tx, task, step_index)?; // advance the resume cursor
tx.commit()?; // COMMIT - all of it, or none of it
If the OS kills the process anywhere inside that block, SQLite's write-ahead log rolls the partial work back. The ledger is therefore always at a clean boundary: either "step N is fully committed" or "step N never happened." Resuming just reads the cursor and starts at N+1.
The guarantee, in three invariants: (1) No torn writes - a step is all-or-nothing. (2) Zero rework on resume - the model is never re-asked for a step that already committed. (3) Convergent final state - a run interrupted any number of times produces the same logical ledger as an uninterrupted one. All three are checked by a property test, not just asserted in prose.
When the model misbehaves
Small local models are unreliable narrators. They wrap JSON in markdown fences, emit a chatty preamble before the object, or fail to close a brace. A brittle agent crashes on the first malformed turn. aircoder's inference layer is built to expect this:
- The parser strips markdown fences and takes the last balanced JSON object if the model emits a preamble.
- It accepts
citesas either a string or a list-of-strings - the kind of schema flex small models constantly need. - When parsing fails - or the inference HTTP call itself fails - the supervisor doesn't panic. It synthesizes a
diagnosticevidence row (mlx_decider/parser_errorormlx_decider/inference_error) and keeps the loop moving. The failure becomes a grep-able fact in the ledger instead of a dead process.
The agent talks to any OpenAI-compatible chat endpoint - by default a local mlx_lm.server on 127.0.0.1:8765 - so it runs against whatever small model you can fit, with no cloud dependency.
The architecture: nine small crates
aircoder is a Rust workspace split into nine focused crates. The split isn't ceremony - it keeps the model-facing seam (inference), the durable seam (ledger), and the decision logic (supervisor) independently testable.
cargo test --workspace is green at 94 tests, and clippy runs clean with -D warnings.Does it actually work?
Yes - and the honest version of "yes" matters here. The deterministic, script-mode demo (a recorded action stream) proves the loop end-to-end on any machine with cargo, no model required. But the interesting question is whether a real small local model can drive it. The first time I pointed it at a live mlx_lm.server, a 3B quantized coder model diagnosed a real bug - a JWT helper missing its base64 padding before urlsafe_b64decode - and resolved the task cleanly:
Step one emitted a decision evidence row naming the file and the one-line fix; step two emitted a resolve whose cites array correctly referenced that decision's id. The entire run is a five-row ledger you can inspect after the fact - 2 episodes, 2 evidence rows, a single plan node, 9,644 bytes of inline payload. That is the whole point: the agent's reasoning is a queryable artifact, not a vanished chat log.
The model used was Qwen2.5-Coder-3B-Instruct-4bit, and getting there took four iterations on the task description - small models are sensitive to how the job is framed. That's a real limitation, honestly reported: aircoder makes a small model survive a long task; it doesn't make a small model as smart as a frontier one.
What it is - and what it isn't
aircoder is an argument that the path to capable local agents isn't only "a better model." It's a better memory architecture around the model. By treating the LLM as a stateless decision function and the disk as the system of record, you get three things a longer context window can't buy you: bounded memory, crash-resumability, and a fully auditable trace of why the agent did what it did.
It is not a frontier-coding replacement for tasks where raw reasoning is the bottleneck - it is architecturally different, not architecturally smarter. But for the world where you want an agent that survives a long task on the laptop you already own, externalizing memory to disk turns out to be most of the battle.
aircoder is a nine-crate Rust workspace (Apache-2.0). It pairs naturally with metalstream, a substrate that streams oversized model weights off NVMe - the two are small-context-by-design and small-memory-by-design, respectively.