The colleague who hallucinates (and the fences that keep it honest)

· series: Working with the Machine

One morning in April I almost shipped a fix for a problem that didn’t exist.

I’d sent three agents off to audit the project’s own governance — read the rules, run the checks, report what’s broken. Routine housekeeping. One of them came back certain: four of my rule files were missing a required field, and a check that was supposed to guard exactly that was failing in CI. It wrote this up cleanly. Specific files. Specific field. A failing gate. It even proposed the four commits to fix it.

None of it was true. The four files had the field. The gate passed — twenty-seven of twenty-seven. The agent had run an audit, looked at green, and reported red. Not maliciously. It just produced the shape of a finding that sounded right, the way the tavern-keeper in my first post produced a dead brother who never existed.

I’d already started consolidating its report into a patch plan when something felt off and I went and looked myself.

The thing that saved me was a human glance

Sit with that, because it’s the whole problem.

The only reason a fabricated bug report didn’t turn into four real commits against my codebase is that I happened to be paying attention that morning. I read the agent’s confident paragraph, felt a flicker of wait, didn’t I just fix that?, and pulled the file up by hand.

That is not a control. That’s luck wearing the costume of a control. It doesn’t scale past the days I’m sharp, it doesn’t survive me being tired, and it definitely doesn’t survive the thing I actually want — more agents, doing more work, while I’m not watching every keystroke. If my safety net is “Carlos notices,” then every hour the machine works unsupervised is an hour with no net at all.

So the question stopped being how do I get the AI to stop hallucinating — you don’t; that’s like asking the dice to stop rolling — and became something I could actually build against:

You don’t make it honest. You make it checkable.

This is the same move I made when I deleted two months of working code. Back then the lesson was about contracts: an aspirational contract is one you wrote down and hope is true; a mechanical contract is one that breaks the build when it’s violated. I’d been trusting aspirational contracts and calling it engineering.

The collaborator is the same story, one level up. I can’t audit the AI’s intentions — it doesn’t really have any, and even if it did I can’t see them. What I can do is build the workspace so that the dangerous things it might do are either impossible or self-reporting. Not “please be careful.” Structurally unable to be careless without it showing.

An honest colleague you can’t verify is worth less than a careless one you can. Skaldborn is built by an AI I assume is careless, inside fences that make carelessness bounce off.

Here’s what those fences actually are.

What an AI colleague actually is on this project

There isn’t an AI on Skaldborn. There are several, and they’re deliberately not the same actor.

The work is split into roles, each with its own context and its own blinders:

  • The architect plans and writes the decision records. It never implements.
  • The writer implements exactly one scoped slice. It never plans.
  • The reviewer reads the writer’s diff against the contract. It never writes code.
  • The verifier runs the scenario and checks pass/fail. It doesn’t reason about implementation.
  • The steward keeps the trackers and manifests honest.
  • The orchestrator runs the loop — and is forbidden, in writing, from reading source code at all.

That last one sounds insane until you see why. The orchestrator’s job is to decide what gets dispatched to whom. The moment it starts reading source to “just check something,” it’s no longer scoping work — it’s doing the work, in the one context that’s supposed to stay clean enough to catch when a piece of work is mis-scoped. So its rule file says, flatly: contracts, logs, health output, packet history — never source. If it can’t figure out a dispatch from those, the correct output isn’t a guess. It’s the sentence “contracts are underspecified,” and a stop.

I launch these as separate sessions with a one-line script. The role isn’t a vibe — it’s bound at launch, down to which model runs it:

$ start-claude.sh writer-query-api --role writer -p "Implement the bounded slice in the packet ..."

Remote-control name: sklaude-writer-query-api-04417
Claude Code session 'skaldborn-writer-query-api-04417' started.
Role: writer (model: claude-opus-4-8)
Expected handoff: state/coordination/writer-query-api-handoff.md

start-claude.sh spins up a detached session, pins the model to the role, drops a sentinel file that says this slug is running, and tells me exactly which file will appear when the session is done. Completion isn’t “the agent said it finished.” Completion is a handoff file exists on disk. The session can crash, run out of context, or wander off — the only thing that counts as done is the artifact landing where the script said it would.

None of this matters, though, without the part that makes the writer’s blinders real.

The fence around the writer

A writer agent is handed a packet — a scoped work order. The packet names the component it’s allowed to touch and, critically, lists allowed_source_roots and forbidden_source_roots: the exact directories it may read and write, and the ones it may not.

I used to hope the agent would respect that. Now a hook enforces it, and the agent’s own good intentions are irrelevant.

Before any writer is dispatched, the orchestrator stages a tiny scope file naming the component contract. The first time the writer tries to touch a file, a PreToolUse hook — enforce-component-boundary.sh — wakes up, reads that contract’s allowed roots, and checks the path. If the file is outside the fence, the tool call doesn’t happen:

{
  "hookSpecificOutput": {
    "hookEventName": "PreToolUse",
    "permissionDecision": "deny",
    "permissionDecisionReason": "Component boundary violation: <path> is outside allowed roots for component '<name>'."
  }
}

That’s not a log line written after the fact. It’s a refusal before the edit. A writer scoped to the query API physically cannot open a simulation file, no matter how reasonable its plan to do so sounds. And there’s a meaner case the hook handles too: if a writer is dispatched with no scope staged at all — the orchestrator forgot — it doesn’t shrug and allow everything. It denies the very first file operation, with a reason that says exactly what went wrong: component scope missing — the orchestrator must stage scope before calling the writer.

This is the whole philosophy in one design choice: fail closed. A misconfigured dispatch — the exact kind of mistake a busy orchestrator makes at 1am — doesn’t quietly grant an agent the run of the repo. It stops the agent at the door. The unsafe default is the one that can’t happen.

(People ask why I don’t just give each writer its own git worktree. I tried. Eleven stale worktrees once stranded eighteen commits, and I went looking for them like lost luggage. The boundary hook plus packet scope gives me the isolation without the graveyard. Writers commit straight to main, inside their fence.)

The fence around the investigator

The boundary hook guards writing. But my April near-miss wasn’t a writer. It was an investigator — an agent sent to look around and report. Those are the dangerous ones, because their output isn’t a diff you can review line by line. It’s a confident paragraph, and confident paragraphs are exactly what these models are best at producing whether or not they’re true.

So that incident got its own fence — and it’s my favorite one, because of what it forces.

Now, before any investigation-class agent can be dispatched, the dispatcher has to write a sidecar that says, in advance, how the finding will be proven. If it doesn’t, the hook refuses the dispatch outright — no sidecar staged at state/coordination/pending-subagent-audit.json (or its freeform sibling), no Agent call fires at all.

The audit sidecar isn’t bureaucracy. It makes the claim falsifiable by construction:

{
  "packet_type": "audit",
  "canonical_commands": ["make validate-governance-enforcement-declared"],
  "canonical_files": ["docs/governance/playbooks/..."],
  "expected_evidence_format": "raw_stdout"
}

canonical_commands is the exact command whose output is the evidence. (In this case, a check that walks every governance rule in the repo and asserts each one declares how it’s enforced — a real script, a hook, a CI target — versus being a nice paragraph nobody runs. It’s the gate the April agent claimed was failing.) expected_evidence_format: raw_stdout means the agent’s report has to quote that command’s actual output, not summarize it, not characterize it — paste it. You can’t claim a gate is red if the gate’s own stdout, sitting in your report, says green.

Think about what this would have done in April. The agent that fabricated a failing check would have been forced, before it ever ran, to declare: the proof of this finding is the verbatim output of this command. And the verbatim output was twenty-seven of twenty-seven, passing. The lie and its own disproof would have been in the same report. I’d have caught it in a glance — but now the glance is guaranteed to have something to catch on, every time, whether or not I’m sharp that morning.

There’s an escape hatch, because not every look-around is a fact-claim. “Find me the usages of this function” isn’t an audit; it’s a search. For those, the dispatcher writes a freeform sidecar with one required field — a plain-English reason — and the hook lets it through. That’s deliberate. The discipline I want isn’t “fill out forms forever.” It’s: if you’re going to assert a fact about my repo, name your proof up front, and if you’re just poking around, say so out loud. The escape hatch is visible in the audit trail too, so “freeform everything” can’t quietly become the lazy default.

How the fences come together: the loop

Day to day, the orchestrator runs a loop that’s almost boring, which is the point. Boring is what you want from the thing holding the leashes.

It reads the next unblocked packet. It stages the writer’s scope. It dispatches the writer — which can now only touch its fence. When the writer’s done, it dispatches a separate reviewer against the diff and the contract. When the reviewer approves, a verifier runs the regression profile. Only then does the packet close.

No single agent both writes the code and blesses it. No agent that’s reasoning about the system is also the one searching it and reporting facts about it. The investigator that could hallucinate can’t dispatch without declaring its proof. The writer that could overreach can’t open a file outside its lines. And the orchestrator coordinating all of it can’t read source to form opinions of its own.

Every one of those is a structural separation, not a polite request. The agents are good colleagues. I just don’t build as though they are.

What it feels like to work this way

I won’t pretend it’s frictionless. It isn’t.

Every writer dispatch needs its scope staged first. Every fact-finding mission needs its proof named first. There are mornings the boundary hook denies something I genuinely wanted, and I have to stop and ask whether the fence is wrong or my plan is — and embarrassingly often, the fence is right and I was about to let an agent reach across a boundary I’d drawn for a reason.

But here’s the trade I actually made. The friction is paid by me, up front, in small visible amounts — a scope file, a named command. The alternative cost is paid later, invisibly, in fabricated bug reports that become real commits, in an agent quietly rewriting a file three components away, in the slow rot of a codebase edited by something I trusted instead of checked. I’ve paid the second kind. It’s much more expensive, and you never see the bill until it’s overdue.

The fences don’t make the AI smarter. They make its mistakes cheap and loud instead of expensive and silent. That’s the entire deal.

What I’d tell myself in February

If you’re about to hand real write access to an agent — and more of us are, every month — here’s what I wish I’d known before the April morning:

  • Assume carelessness, not malice, and definitely not competence. Design for the agent that confidently reports red on green. It will happen. Plan for it instead of being surprised by it.
  • A human glance is not a control. If your only safety net is that you’ll notice, you have no net the moment you look away — and the whole point of an agent is to look away.
  • Fail closed. A misconfigured dispatch should stop the agent at the door, not silently grant it the keys. Make the unsafe default impossible, not merely discouraged.
  • Separate the roles that could collude. Don’t let one agent both write code and approve it, or both reason about the system and report facts about it. Cheap separations prevent expensive failures.
  • Make every factual claim name its own proof. “The gate is failing” is worthless. “The gate is failing; here is its verbatim output” disproves itself when it’s wrong. Force the second shape.
  • Pay the friction up front, on purpose. A scope file and a named command are small, visible costs. The thing they prevent is a large, invisible one.

The agents do real work on Skaldborn — most of it, by volume. I sleep fine about that, not because I trust them, but because I stopped needing to.

The companion to this one is the technical build guide: How to let an agent write your code without giving it the keys — a line-by-line walk through the launch script, the two fail-closed hooks, and the dispatch loop, close enough to the real files that you could stand up your own version of these fences over a weekend. This post is the why; that one’s the how.

If you want to follow along, subscribe via the form at the bottom of any page — one short email when the next post lands. If you want to argue — or just tell me what you’re wiring up — write to devlog@skaldborn.com.

Everything else is the boring engineering of making it true.