Specs Steer Humans. Tests Constrain Agents.

2026-04-08T12:00:00.000Z

I spent 730 commits and 14 days building swain, a system for keeping AI agents aligned with what I'd decided. The core loop is still right. The enforcement mechanism was wrong.

---
title: Swain's Core Loop
---
flowchart LR
    intent["Intent"] --> execution["Execution"]
    execution --> evidence["Evidence"]
    evidence --> reconciliation["Reconciliation"]
    reconciliation --> intent

    intent ~~~ evidence
    execution ~~~ reconciliation

You Can't Steer With Specs

The hypothesis seemed sound: agents forget what you decided between sessions, so capture those decisions as artifacts. I built ten artifact types, a dependency graph, a roadmap renderer with Eisenhower quadrants and Gantt charts — all for the intent half of the loop. I automated reconciliation too: a retro skill that gathered session context, synthesized what happened, and classified learnings into new specs and ADRs.

Agents went off the rails anyway.

The VISION-006 session retro is the clearest example. In one session, the architecture shifted three times:

Ports & adapters → plugins. The initial design assumed adapters would be core contributions. A few iterations in, I realized this would block community extensions and pivoted to a plugin model.
In-process → subprocess. The agent implemented adapters as in-process Python classes. We had a documented architectural decision saying plugins must run as subprocesses. The agent's code worked. It passed tests. It was structurally wrong, and only my review caught it.
tmux scraping → HTTP API. The agent built a tmux adapter that worked but was fragile. I asked why sessions weren't showing up in tmux ls. Turns out opencode run is single-shot — we needed opencode serve, a completely different architecture.

The hierarchy was supposed to tell agents what to build. Telling didn't work. From the retro:

"The first half was spent debugging live... The operator said 'stop. reset. TDD from architectural plan.' After that, tests drove every change. Every pivot was validated before going live. The live debugging wasted 60+ minutes; the TDD approach wasted zero."

This wasn't a one-off. Across the project I kept finding the same failure modes:

Code that was tested in isolation but never wired into the system — three separate incidents (1, 2, 3).
Stale state files that referenced worktrees I'd already deleted.
Agents that linked new artifacts to their dependencies but never went back to update the things they depended on.

The overnight artifact sweep was the one that really got me: 9 specs, already fully implemented, still marked Active. The code existed. The features worked. Nobody — human or agent — had gone back to update the spec.

And the retro skill caught all of it. Every session, it would synthesize the drift, surface candidates for new specs and ADRs. I'd tell swain to integrate them and move on. But the output of reconciliation was always more intent: new specs, updated ADRs, revised constraints. The next session's agent would read those artifacts, and drift again. The retro would catch that too, produce more specs, and the cycle would repeat. Automated, fast, and never catching up — because the whole system was a spec-production loop, and the steering mechanism was always one session behind the thing it was trying to steer.

Tests Run During, Not After

Halfway through that VISION-006 session, I stopped debugging live and told the agent to write tests first. The dynamic changed completely. The agent wrote the tests, then wrote code until the tests passed. I wasn't reviewing code anymore — I was reviewing tests. If a test described the wrong behavior, I'd say so and the agent would rewrite it. Everything else happened in the TDD loop without me.

This held across every model I used. Opus 4.6, Sonnet 4.6, Qwen3.5:397b, Gemma 4:31b — without tests they all drifted, with tests they all converged. The model didn't matter.

But TDD only got me partway there. The in-process adapter implementation had tests and passed them all. It was still architecturally wrong — the tests checked behavior, not structure. I caught that violation myself. TDD kept the code functionally correct; whether it respected the boundaries I'd designed was still on me.

More Tests, Not More Specs

TDD worked, but only at the behavioral layer. The 80-test suite caught regressions and kept agents converging on functional requirements. Architectural compliance was still me. I was the fitness function, reviewing whether agents respected subprocess boundaries and plugin isolation. When I caught a violation, I'd write it into an ADR. Another spec.

Swain's first iteration asked: what if we had more kinds of artifacts to steer agents? Ten types, a dependency graph, automated retros feeding back into the system. The answer was clear — more specs weren't enough.

The next iteration asks the opposite question: what if we had more kinds of tests? Not just unit and integration tests for behavior, but fitness functions for architecture, boundary tests for isolation, protocol tests for communication contracts. Every constraint I wrote into an ADR and hoped agents would read — encode it as a test that runs during execution and fails on violation.

Swain v1 built a spec factory. Swain v2 needs a test factory.

Coming up in this series: how different kinds of tests map to different kinds of drift, what a minimum viable test suite looks like for a real project, and whether specs can be generated from evidence instead of written up front.

If you're building with agents and have your own hard-won lessons, I'm at @cristoslc. The swain codebase is public — retros in docs/swain-retro/, tests in spec/.

Why Swain Exists

2026-03-08T12:00:00.000Z

AI coding agents have a fundamental problem: they're fast, stateless, and uncritical. They'll implement whatever you ask — or whatever they infer from incomplete context — with real credentials and real consequences.

The result isn't immediate failure. It's slower: architectural drift that compounds silently. Decisions made in week one are forgotten by week twelve. Constraints get violated under speed pressure. The reasoning behind past choices evaporates when the conversation ends.

The core insight

Artifacts are the single source of truth. If it's not written down, it wasn't decided.

Swain captures your decisions as markdown files in git: specs, ADRs, architectural constraints, component boundaries. These artifacts become the context that agents read before acting. When agents make decisions, those get captured too — visible for review, versioned for audit.

The methodology

Swain structures everything around a four-phase loop:

Intent → Execution → Evidence → Reconciliation

Intent: You declare what should be true. Human-authored specs, constraints, goals.
Execution: Agents implement the intent. Swain provides the context but doesn't control how agents work.
Evidence: Swain observes what actually happened — git history, test results, dependency graphs. Verifiable, not declared.
Reconciliation: Swain compares intent against evidence and surfaces divergence. Sometimes execution drifted. Sometimes intent was wrong.

The loop is continuous. Most often you fix the code. Sometimes you fix the decision.

What this gives you

Session continuity — Every session picks up where the last left off. No re-explaining context.
Decision preservation — What was decided lives in artifacts, not memory.
Bounded risk — Agents run in isolated worktrees. Mistakes have limited blast radius.
Automated integrity — Downstream work is checked against prior decisions automatically.

Quick start

npx skills add cristoslc/swain

Then run /swain-init in your first session and ask:

/swain-roadmap

This shows active epics, decisions waiting on you, implementation-ready work, blockers, and GitHub issues — all in one view.

Who this is for

The operator: A solo developer who makes decisions and delegates implementation to AI agents. Swain is your decision-support system. You steer; Swain keeps the record.

The agent: Any AI that reads markdown. Swain provides alignment context and verifies outcomes. The agent runtime is interchangeable.

What's next

Swain is in early development — actively used but expect rough edges. The core is stable: design artifacts, task tracking, sync, and releases.

Coming soon: drift detection, automated reconciliation reports, better GitHub integration.

Named for the swain in boatswain, the officer who keeps the rigging tight.

Read the documentation →