Agentic validation needs different infrastructure

Previously, I described some core approaches to validating agent written code: feedforward and feedback techniques. Feedforward techniques are about avoiding errors up front, for example by coming up with better prompts and planning strategies. Feedback gives agents a signal that they have actually achieved a task. Feedback is a key part of common agentic patterns like Ralph loops or the /goal commands in Codex and Claude Code: keep working until some known condition passes.

In this post, I’ll talk about problems that come up when building agentic feedback loops and alternatives we’ve found effective at CircleCI.

Infrastructure is a bottleneck for validation

Your validation checks — be they linters, tests, or even other agents — have to run somewhere. In development, this usually happens on the engineer’s laptop, a remote VM or codespace, or possibly in a remote build system. This also applies when you start having agents write your code as well. But agents create new problems for the typical local and remote testing approaches. They also create opportunities to test and validate code more efficiently and in an agent-first way.

Local validation as the first pass

Starting out, most teams do validation on the same machine the coding agent runs on. This approach has a lot going for it. Your developer probably already has some environment to work in. They can drive the agent from that environment and can jump into an IDE if they need to take over from the agent. Coding agents know how to run tests, linters, and build commands and figure out how to fix errors. It’s a pragmatic extension of what your engineers were already doing before AI.

But some issues start to pop up when you are running multiple agents for a long period of time:

It can be difficult to run multiple instances of a service on a single machine. Even with enough CPU and RAM setting up full stacks that all run on localhost is a painful exercise.
Large validation suites (think Playwright tests) consume lots of resources, and this will eventually make your laptop crawl.
Local environments get “crufty” with engineer-specific settings and environmental quirks. One of the reasons for using clean CI environments to run tests is to eliminate cases where code “works on my machine” but not in a known good environment.
For most teams, local validation doesn’t “count” for approval purposes. Even if your agent does all the required checks, you still end up waiting on CI pipelines to clear merge approval checks

All of these issues are fixable: you can make your agents run in containers on your machine to get reproducibility. Tools like Vercel’s portless can help with the localhost problem. But they usually require you to make changes to your application to make things work and you have to keep the setup consistent across developers.

The bigger problem with colocating the agent with the validation environment is the coupling it creates. When a developer has full access to the machine the agent is in and is running multiple agents themselves, they can troubleshoot anytime the agent gets stuck. But, increasingly, more agentic coding is happening in cloud-based agents via automations or background agent systems like Devin. Here environment management gets more complex. You can customize environments with init scripts or custom containers, but you’re ultimately at the mercy of what your agent infrastructure provider supports. As agents work longer on tasks, the chances of failure also become higher. If your local disk fills up, or the memory is exhausted, or the agent just accidentally bricks the machine, everything crashes.

It’s much more flexible to make the validation environment a tool that is accessible to the agent and that you can fully customize instead of trying to configure one environment to rule them all. You can swap out the agent or the validation environment independently. You can even have agents hand off work by passing around the environment. Claude’s Managed Agent design or Deepagents pluggable sandboxes are examples of this approach.

Use CI as the feedback source

At this point, many teams look around and realize they already have a remote, scalable, and reproducible environment: their CI pipeline. If you have to pass CI anyways, why not just have the agent push to CI and loop off of the feedback from CI? Codex and Claude Code both have skills to loop and babysit a PR until CI passes and reviewers are satisfied. Using CI has a lot going for it: you’re no longer limited by how many instances of your app can run, you know exactly what checks will run, and you have a built-in security barrier where your agent can’t directly access secrets.

This approach works! But like local validation, there are a lot of downsides. First, CI pipelines have some startup overhead. The downside of a “clean” environment is that you start from scratch each time. Next, many CI pipelines do more validation than your agent needs. Because CI functions as a gate, it becomes a magnet for every kind of check your org does: CVE checks, docs generation, performance testing. And these checks run every time, regardless of the code you actually changed. Like with fresh environments, this is the point of automation: you want the consistency to avoid breaking main, but it’s probably overkill for in-flight development.

Next, most CI pipelines are built to catch relatively rare errors. A common approach to setting up CI is to fan out your build steps so they run in parallel. Most CI systems, including CircleCI, don’t stop running fanned out jobs when one fails since you want to collect all the feedback so the developer can fix the build. Otherwise, a linting error could mask tests that also break. If you assume that most of the time things will pass, because your engineers definitely ran the tests locally before pushing, this cuts down on the wall clock time of a build. But, as you develop you’re likely to break tests, have code that fails linters, or do something that would break the build in CI. If you use the CI pipeline as your feedback source, you’re going to experience a lot of failures, but it can take a while to get feedback because your pipelines will still be running.

Then there’s the problem of getting feedback to the agent. While CI systems can give your agents build output, it’s not as direct as direct command output. You usually have to hop through APIs, reassemble logs, and then diagnose the problem. This all ends up spending tokens to gather the requisite information.

Finally, there’s the cost issue. Agents working iteratively can drastically drive up CI usage, which increases costs.

Like with local development, you can make changes to your CI process to make it a better local development loop. Caching can help reduce dependency install times. If your build system supports it, you can do incremental builds. To prevent issues with high fan out, you can create special pipelines that are more “pessimistic” and run a subset of checks sequentially or as a single job. These “dev” pipelines can give agents results in a single file to cut down on tool use and output eating tokens. The downside is, you’re now trying to bolt on capabilities to the CI pipeline it wasn’t really meant to do. You have to be careful about how you set up triggers to avoid running the full pipelines for example.

Enter Chunk sidecars + microbuilds

What you really want is the best of both worlds: remote testing environments, with agent friendly feedback, and a way to attest that certain checks already run. All without exploding CI costs.

Chunk sidecars are fast remote environments based on a firecracker microVM runtime. The microVM approach gets us a lot in terms of speed, scalability, and cost effectiveness. Our microVMs run remotely, so you aren’t limited by resources on a single machine. The environment starts very quickly, in tens of milliseconds, which is faster than typical CI jobs. We can also suspend a microVM when it’s not in use and then restart it when your agent needs to test something. Instead of doing a full CI pipeline on fresh VMs and containers on every iteration, you can reuse the same instance for your entire session. This is a much closer approximation of the developer flow. We can also base microVMs off of a known snapshot, so you can do initialization and dependency installation periodically to keep your environment up to date. With the snapshot approach, if any given instance becomes unstable, you can trash it and start a new one from a known good point. This isn’t full hermetic reproducibility, but it is an acceptable tradeoff for agent development.

Chunk sidecars provide the environment, the microbuild is what provides the feedback. The idea behind a microbuild is pretty simple: run a distilled version of what you’d run in CI. This is similar to having special pipelines for in-flight work, but without having to mess with your CI system. When your agent runs chunk validate, the commands are run directly on a Chunk sidecar and the output is sent straight to the agent. Instead of having to poll a CI pipeline, the results are sent straight back to the agent where it can then iterate until tests pass. Your agent also has access; this is like having an always-on “debug with ssh” option to your Chunk sidecar. So your agent can do things like distinguish between legitimate failures in the code or problems with your environment and of course suggest a fix.

By default we use hooks to trigger microbuild runs on every agent stop event when there are changes in the git repo. The stop event is frequent enough to keep the agent from writing too much code before surfacing errors, while still leaving room for the agent to autonomously run some checks as it works. But Chunk sidecars can be created programmatically with our CLI, which means your agent could (for example) create many sidecars to shard out tests.

Microbuilds are also made to be incremental. One annoyance with a CI pipeline is having to create a commit, push, then wait for a result. With the microbuild we do incremental syncs (using git’s patch mechanism) from your local checkout to the remote Chunk sidecar. This further reduces latency for your agent.

Results

We’ve been using the Chunk sidecar and microbuild pattern to build out the Chunk sidecars at CircleCI.

So far, the results have been promising:

Token efficiency - the output from microbuilds is 3x more token efficient for agents to identify errors than using CI logs directly. We measured this by looking at the percentage of “useful” tokens: fetching CI logs requires additional tool calls and the logs often have results that aren’t useful to fixing a failure like installation output.
Cost effectiveness - the Chunk sidecar environments are 10-20x more core cost effective than the CI pipeline. Most of this improvement comes from the sidecars having prewarmed environments which cuts out on setup time, and being able to fail fast instead of waiting for fan out jobs to complete.

Where is this going?

The current implementation of Chunk sidecars and microbuilds addresses an immediate problem today: the traditional infrastructure we use to validate code is under pressure. We’ve dramatically sped up the process of writing code, but if you have thousands of Playwright tests, you need enough compute to run them then multiply this by the number of agents you have and the infra costs go up fast.

But looking beyond the immediate concerns, Chunk sidecars are part of a broader trend of agent-first infrastructure. For the last few years, we’ve been fitting AI development into tools built for humans: have the agent run on your laptop, have the agent monitor PRs and CI pipelines, and so on. But as the tooling has improved, and agents can now run for hours or days without human supervision, we’re seeing more purpose-built architectures for agents. We now give agents browsers. With Chunk sidecars, you can give an agent an additional computer to scale out work. You can imagine small agent-focused cloud environments as well.

Check out Chunk sidecars by installing the Chunk CLI and run chunk init in your project. You just need a CircleCI account to get started. Demo here.