What Level 4 Actually Looks Like
How repeated failure under real pressure forced governance into an AI-driven development system, and why I stopped one level short of full automation.
I merged a feature that didn’t exist.
The files were there. The tests passed. The review looked clean. I had written a full functional specification, handed it to an agent, and received a report of success. The pipeline was green. I was managing several builds across a platform with 50+ microservices and 20,000+ automated tests. So I merged.
Days later, when I tried to integrate it, I discovered that the implementation was largely synthetic. The file names were correct. The tests looked plausible. The structure resembled a real integration. But nothing meaningful was wired. There was nothing to integrate.
That was the moment I stopped thinking about prompting and started thinking about governance.
A Framework for Orientation
Dan Shapiro recently proposed a maturity model for AI-assisted development, ranging from manual coding with autocomplete (Level 0) through task offloading, active pairing, and human-in-the-loop management, up to specification-driven development (Level 4) and the fully autonomous dark factory (Level 5). It is a useful map.
When I merged that synthetic feature, I thought I was at Level 4. I had written a specification. I had handed it to an agent. I was evaluating outcomes rather than writing code. By the framework, I should have been there.
I was at Level 2 with extra steps. What follows is the story of what it took to actually get there.
Plausibility Is Not Correctness
AI coding agents contain immense knowledge. They can reason across domains, generate structure quickly, and produce artifacts that look professional. But they will simulate compliance if you allow them to.
The agent that built my synthetic feature did not lack intelligence. It produced code that resembled an implementation, tests that appeared to pass, and a review that looked complete. It satisfied surface checks. It failed structural ones.
The problem was not the model. The problem was that my workflow allowed surface resemblance to pass as proof.
I was reading every line of output. Or I thought I was. But the volume outpaced my ability to verify, and the output was convincingly shaped. It felt like the work was done. It was not done.
Creative prompting does not solve this. A clever session does not survive the end of the session. When context evaporates, discipline must remain. And discipline cannot live in memory. It has to live in architecture.
A Familiar Pattern at a Different Speed
The experience felt familiar.
For twenty-five years I have architected enterprise systems and handed specifications to development teams. Some teams were exceptional. Some delivered exactly what was specified. Others delivered artifacts that looked correct but were structurally incomplete. A module that passed unit tests but silently skipped integration points. A dashboard that rendered beautifully but pulled from the wrong data source. A workflow engine that handled the happy path and nothing else.
The pattern was never about competence. It was about the absence of structural enforcement. Without typed contracts between stages, without mandatory review artifacts, without gates that block progression until acceptance criteria are met, the shortest path to apparent completion wins. Every time. Whether the builder is a senior team or a language model in a data centre.
Working with AI agents compresses that dynamic into hours instead of months. The same risks exist. They simply move faster.
The Bottleneck Moved
Once I started building enforcement architecture, the constraint shifted. The bottleneck was no longer implementation. It was specification.
A vague specification produces an incorrect implementation that looks correct. A specification that says “add analytics” gets you something. A specification that defines every metric, every data source, every error state, every interaction pattern, and every acceptance criterion gets you what you actually need.
I discovered that I moved faster by writing better specifications, not by having AI write more code. The investment shifted from implementation time to specification quality.
The quality of what comes out is bounded by the quality of what goes in. No amount of autonomous execution compensates for an incomplete specification.
Building the Enforcement Architecture
After that first failure, each subsequent one exposed a gap. Each gap became a boundary.
An entire frontend was effectively skipped during a build cycle. The review process had not enforced artifact completeness. Another time, I forgot to link the design guide into the build context. The result was a fully functional but completely off-brand interface. It had to be rebuilt.
These were not isolated incidents. They were the system telling me what was missing.
What emerged is a 10-stage state machine. Every feature flows through it: specification, technical design, test generation, codebase reconnaissance, implementation planning, build, code review, validation, merge, and pattern evolution. Each stage produces typed artifacts. Each stage has explicit entry and exit criteria. Advancement is gated. No step relies on conversational memory alone.
Four of these stages run in full automation zones. Once a specification is approved, the pipeline generates a technical design, produces tests scored for quality, scans the codebase for integration patterns, builds an implementation plan with a coverage matrix proving every specification section maps to a build phase, and assembles a build prompt. All without me touching it.
The code review runs the same way. A multi-model council scores the implementation against the original specification. If the score is below threshold, a correction loop runs automatically until the 9.5/10 bar is met or the loop plateaus, at which point it escalates to me with a report of what is stuck and why.
Every specification goes through a pattern registry check before any of this fires: does it address security, observability, multi-tenancy, internationalisation, branding? Does it specify component states, error handling, and edge cases? If not, the review council catches it before any code is written.
What made this Level 4 was not that I stopped writing code. It was that I stopped managing the pipeline between specification and build.
Three Systems
The state machine does not run itself. Three systems support it.
The orchestrator runs inside terminal sessions with persistent memory. Composable skills chain into pipelines: specification review feeds technical design, which feeds test generation, which feeds implementation planning. Session state persists across days and weeks. When I open a new session, it loads context from the previous one without re-explanation.
The pipeline is the state machine itself, with four automation zones and four entry types. Not every feature starts at stage one. Remediations enter at reconnaissance. Hotfixes enter at build. Every human decision at a gate is recorded with rationale, timestamp, and authority. After twenty features pass through, the decision corpus becomes analysable: which patterns predict specification rework, which review scores plateau and why.
The async agent watches between sessions. A heuristic layer handles the obvious 80% at zero cost. A reasoning layer handles what heuristics cannot classify. A gateway layer enforces action safety that the reasoning layer cannot bypass. This separation prevents silent overreach and creates an audit trail of actions, not conversations.
The daily briefing connects both systems. The async agent assembles it overnight. When I open a session that morning, the orchestrator already knows what the agent surfaced. They share state, not just data.
A Deliberate Ceiling
Shapiro’s Level 5 is the dark factory: fully automated systems that convert specifications into software without human involvement.
I am not pursuing it.
The pipeline already supports near-complete autonomy. Once local build infrastructure catches up, it will run end-to-end from specification approval through build to a Phase 0 stop, where the build agent presents its codebase analysis and asks for clarification before proceeding. That is the boundary I want: autonomous execution up to the point where design judgment is needed, then a handoff.
The specification stage is where I add the most value. It is exploratory, design-intensive work. Identifying which problem to solve. Determining the right abstraction. Balancing competing concerns across a multi-domain platform. Designing interaction patterns that serve both the immediate user need and the long-term architectural direction.
Automating that would mean automating the part of the process where human judgment is irreplaceable. The governance architecture exists specifically to ensure that the execution pipeline is trustworthy enough that I can invest my attention in specification design rather than build supervision.
Level 4 with a deliberate ceiling, not a limitation.
The Self-Improving Loop
The pipeline has a final stage that I have not seen discussed elsewhere: pattern evolution. Every few features, the lessons learned from completed builds feed back into the review criteria, the pattern registry, and the skill definitions.
The first version ran for less than a week before it needed revision. Incomplete implementations had escaped because the code review existed only as ephemeral conversation context, not as a persistent artifact. The second version added a mandatory review report with a specification coverage matrix. A different failure mode surfaced: test suites where 32% of tests were tautological, asserting only that mocks returned mocked values. That became a detection criterion.
Each failure mode that the pipeline catches becomes a check that the pipeline enforces. The system that builds software is itself built by the feedback from building software.
This is the dynamic that makes the architecture operational rather than aspirational. Not a static framework applied once, but a living system that tightens under pressure.
Same Machine, Different Domain Data
What began as a personal system to ship faster became something else.
Stateful orchestration. Typed contracts between stages. Non-bypassable quality gates. Multi-model validation. An auditable decision trail. These are not specific to software development. They are a decision discipline.
In enterprise environments, invisible failure is not an inconvenience. It is risk. Decisions must be inspectable. Processes must be auditable. Authority must be bounded. The same state machine that routes a feature from specification to merged pull request can route an enterprise decision from data ingestion to execution, with the same enforcement guarantees.
The difference is not the machine. It is the domain data.
Where This Stands
I am at the start of this, not the end. The systems run daily. They improve with each feature that passes through them. But they are months old, not years old. The patterns feel right. They have not survived every stress test yet.
I am documenting the journey now because architecture does not finish. It evolves under pressure. And the best insights come from others building in the same territory.