Patterns of validation

This post covers some techniques to get agents to validate their work. We’ll cover why this is important, a variety of types of checks you can employ, and mechanisms you can use to enforce them.

The validation shift

There’s been a recent emphasis on giving agents tools to validate the code they write. This comes up with any kind of “long horizon” agent task:

The compiler that anthropic built with claude, ended up needing a CI pipeline to keep tests passing
OpenAI’s harness engineering post combines mechanical and agent-based checks to make sure agents are sticking to design principles.
Cursor’s cloud agents emphasize producing artifacts as proof the agent accomplished the goals
The ralph loop depends on some kind of “back pressure” mechanism to validate work.

The list could go on, but the concept is the same: don’t trust the agent to just do work correctly. Provide a way to check the work.

Why now?

It’s worth asking why this is becoming a hot topic now. A reductive, but useful, model of “AI capability” is how long agents can work on a task without a human intervening. For a long time, we were building scaffolding to connect tools, e.g., the move from copy/paste between terminal output and ChatGPT to an agent in your IDE/terminal that could run code. This immediately increased the time agents could complete tasks since they could run the commands for you.

Now though, we’re seeing agents being deployed on hours- or days-long tasks, sometimes in parallel, without a human babysitting every Codex or Claude session. Codex and Claude Cowork even let you automate tasks like continual refactoring or test improvement.

With this rate of change, you really need a system to stop the agents from making a mess of things. This has always been true of large codebases, disciplined architectural patterns, tests to prevent regressions and breaking things across modules, dealing with breaking changes and so on. But the reality is, most codebases never get to that level of complexity. Writing linters to enforce a “layered architecture” design, like OpenAI did for their internal project, makes sense if you had thousands of developers on huge code bases; it’s an exercise in pedantry with a team of 3 people who are reviewing all the code by hand. Agents just let you hit this critical mass of complexity and conflict faster now.

The rest of this post covers things I’ve found useful when building largish systems including migrating services to a new language, building custom plugins for third party systems, and building tools to orchestrate testing and merging agent written code.

Techniques for validation

I’ve found a layered approach to validation to be the most effective. Starting with code-base-level guidelines, instructions in longer-range tasks, and automatically triggered behaviors.

I’ll talk about each below

“Feedforward” instructions

The easiest place to start with validation is by prompting better. This is a feedforward style, where you try to prevent the agent from making mistakes. It’s easy to implement but has some limitations where you eventually want a check that always runs.

CLAUDE and AGENTS md files

You should start with the top-level agent instructions files to include directions on how to run tests and format the code. You can do this directly, if you have a small set of instructions. But once a codebase gets large, your instruction files should be more like references to find information. For example, instead of inline commands, reference using a make file, or have a separate document for working in the codebase.

I try to keep my top-level AGENTS/CLAUDE.md files less than 100 lines, based on the approach in OpenAI’s harness engineering post. The file is a table of contents pointing to structured documentation to enable progressive disclosure.

Task-specific instructions

For long-horizon tasks, there’s usually some plan in place. I personally use the Executable Plan style from OpenAI, because it keeps planning documents colocated with the codebase and explicitly allows tracking learnings/surprises as the agent works. Kind of like a scratch pad. Those can then be mined by agents to improve plans and find places it got stuck.

My workflow here is:

Create a plan with the agent (use whatever format you want here).
Have a separate prompt for implementing that includes rules on validation.
Tell the agent to execute the plan in line with the IMPLEMENTATION.md file.

Mechanical checks

Feedforward control via prompts is a good starting point. It’s definitely worth the token cost to say “run the tests while you work”. But there are a few problems you hit:

Limits in the attention mechanism in models means sometimes the tests don’t run.
Reward hacking: models are instructed to make sure tests pass, so they will delete tests or write shallow tests.

The upshot is you can’t always trust the agents to do the validation or that they will do a good job, which means you can end up with totally broken code that you’ve wasted tokens and time to build. A feedback approach where the agents are told what they did wrong is a good second layer.

You can make agents always run tests, or CVE scanners or linters, using hook-based approaches, for example Claude Code and Cursor. Most agent harnesses have a similar hook or middleware-style approach. Chunk sidecars have the same pattern. Hooks fire a microbuld on every stop agent event and send the output back to the agent.

Hooks don’t have to just run commands though. Claude expressly supports one-shot prompt hooks as well as running entire subagents. Cursor also supports prompt-based hooks. This is a good way to bake in adversarial code review, or do “intent checking” for implementation against a spec. For testing specifically, you can track things like coverage regressions to detect reward hacking around tests.

Even if your agent doesn’t support hooks, you can always run validation scripts on a branch as part of a ralph-style script or manually invoking it yourself.

Where is this going?

This post discussed why there’s a push in validation for agents: increasing autonomy and task length. I also covered feedforward and feedback approaches to validation using prompt- and hook-based strategies.

But where does this all go? I think validation is something that exists for a while, but the technique will have to change.

Having to be clever with prompts probably goes down over time. Models will get better at remembering to run your tests.

The feedback approach probably does stick around longer term, at least as long as human compliance systems require proof that certain steps were taken in the development of software, e.g., you have to prove you at least checked for CVEs in a lot of compliance policies.

But more generally some of our deterministic checks are too strict for an agent-led world. Making an agent loop endlessly on a single failed test (that may very well be flaky) is probably not what you want to happen in an agent-driven codebase. Having strict testing at boundaries between systems does make sense though. Having agents check to see what other services or APIs could break for a change and use that to inform a decision also makes a ton of sense. That invariably leads to a risk-scoring-based approach instead of binary/pass fail.

Some validation can also happen post development. For example, running smoke tests on a canary release is a well understood, but not universally implemented, approach to doing high-velocity releases. Feature flags are more common. Agents let you close the loop on managing the rollout and cleanup, but they create a whole experiment-management problem in the process. To get to a post-deploy validation world, we’ll have to invest much more heavily in observability than most teams do to give “deploy agents” visibility into whether a change is broken. Even further down the line, you can imagine connecting to a product analytics tool to see if a change delivered on business objectives.

In the next post in this series, we’ll talk about the infrastructure needed to make these validation techniques work.