Fix flaky tests in your sleep with Chunk

A test fails. You rerun it and it passes. You shrug and move on.

This is how most teams deal with flaky tests. The “rerun until green” approach works in the moment, and rerunning from failed tests is a useful way to confirm whether a failure is real. But reruns don’t fix the underlying issue. Over time, they burn CI resources and can hide real instability in your code.

On the other hand, fixing flaky tests can mean hours of work. It’s tedious debugging that no one wants to prioritize.

What if that work could happen automatically? What if your CI system could analyze failures, pinpoint the cause, and prepare verified fixes while you sleep?

With Chunk, the autonomous validation agent from CircleCI, that’s exactly what happens. Chunk analyzes your test history, identifies flaky tests, determines their root causes, and opens PRs with working fixes. It’s automated maintenance that keeps your pipeline reliable while you focus on shipping great code (or finally getting some well-deserved rest).

This guide walks you through the full setup: enabling Chunk, configuring its schedule, preparing your test environment, and refining results over time. By the end, you’ll have Chunk running on autopilot, quietly stabilizing your test suite in the background.

What you’ll need

Before you start, make sure you have:

A CircleCI account
A project running tests in CircleCI with the store_test_results step configured (this is how CircleCI identifies flaky tests)
An API key from OpenAI or Anthropic

Chunk uses a bring-your-own-key (BYOK) model, so your code stays with your chosen provider and never touches CircleCI’s systems.

Getting Chunk up and running

Before Chunk can start fixing flaky tests, you need to enable it and connect the services it needs. The setup flow in CircleCI will guide you through these steps:

Turn on Chunk: Head to the CircleCI web app, navigate to your organization, and find Chunk Tasks in the sidebar. Click Get Started and then Continue when prompted.
Verify the GitHub App: You should see a passed icon indicating the GitHub App is already installed for your organization. If not, use the Install CircleCI GitHub App button to install it (you will need admin privileges to complete this step). If you’re currently using OAuth, you can safely install the GitHub App without disrupting any existing pipelines or needing to migrate anything.
Select your AI Model provider: Choose either Anthropic or OpenAI.
Enter your API key: Add your API key for your chosen model provider.
Click Next: Completing the setup creates a context called circleci-agents where your API key is stored.

Assigning a task

Once Chunk is set up, you’ll assign your first “Fix flaky tests” task. This is where you configure when Chunk runs and how it operates.

Chunk task setup

First, select the project you want Chunk to analyze. The project should already be running in CircleCI with an active test suite and the store_test_results step configured.

Then configure your schedule and operational limits:

Run frequency — Choose when Chunk analyzes your tests:

Daily: Runs Sunday through Thursday at 22:00 UTC — perfect for overnight fixes during the workweek
Weekly: Runs Sunday at 22:00 UTC (this is the default)
Monthly: Runs on the first of each month at 22:00 UTC for minimal disruption

Operational limits — Control how much work Chunk takes on:

Maximum tests to fix per run: How many flaky tests Chunk tackles in a single run. Lower this if you want to start slow and see how Chunk performs, or keep it higher to work through your backlog faster.
Number of solutions to try per test: How many different fix approaches Chunk attempts if the first one doesn’t work. Start with 1 to keep things simple, increase it if Chunk’s initial fixes aren’t resolving the flakiness.
Number of validation runs per test: How many times Chunk runs each test to verify the fix is stable. More runs mean higher confidence the flakiness is truly resolved, but also longer processing time.
Maximum number of concurrent open PRs: How many pull requests Chunk can have open at once. Dial this down if your team gets overwhelmed reviewing multiple PRs, or keep it unlimited to move quickly through fixes.

The right settings depend on your team’s review capacity and how large your flaky test backlog is. If you’re not sure where to start, the defaults work well for most teams. You can always adjust these numbers based on results.

Click Start Task to complete the setup. Chunk starts running immediately and follows the schedule you configured.

Setting up your test environment

Chunk runs best when it knows exactly how your tests should start up. Defining your environment gives it the context it needs to install dependencies, connect services, and execute your tests consistently. You can provide those details in a .circleci/cci-agent-setup.yml file on your default branch.

The file needs a workflow (you can name it anything) with a single job named cci-agent-setup. Here’s what this looks like for a Python project with Postgres:

version: 2.1

workflows:
  main:
    jobs:
      - cci-agent-setup

jobs:
  cci-agent-setup:
    docker:
      - image: cimg/python:3.12
      - image: cimg/postgres:15.3
    steps:
      - checkout
      - run:
          name: Install Dependencies
          command: |
            pip install -r requirements.txt
      - run:
          name: Setup Database
          command: |
            psql -c "CREATE DATABASE test_db;" -U postgres

Focus on environment preparation only: installing dependencies, setting up databases, configuring services. Don’t include the actual test execution commands. Chunk figures those out by analyzing your main CircleCI config and any instruction files you provide. The cci-agent-setup.yml configuration supports orbs and custom resource classes.

You can test your environment setup by navigating to Organization Settings → Chunk Tasks, clicking the three-dot menu, and selecting Chunk Environment. Run your cci-agent-setup.yml file on a branch to verify it works, iterate on any issues, then merge to your default branch once it’s working.

(Optional) Documenting your test conventions

Every team has unique testing patterns, preferred utilities, or code style conventions. Chunk can learn from these if you document them. Providing this context helps ensure that its proposed fixes align with your standards.

You can add this guidance in one of the following ways:

Create a claude.md or agents.md file in your repository root with general instructions for running tests.
Create a .circleci/fix-flaky-test.md file for flaky-test-specific guidance.

Here’s an example .circleci/fix-flaky-test.md file:

## Command restrictions

- You MUST NOT use the `sleep()` command or `setTimeout()` for delays in any scripts
- You MUST NOT use `eval()` as it poses security risks
- Avoid using shell wildcards in destructive operations (e.g., `rm -rf *`)

## Code style preferences

- Prefer functional components over class components in React
- Use TypeScript `type` definitions instead of `interface` (this project enforces this via ESLint)
- Favor explicit error handling over try-catch-all patterns
- Use async/await syntax over Promise chains for readability

## Security considerations

- Always flag use of `dangerouslySetInnerHTML` in React components
- Highlight any potential SQL injection vulnerabilities
- Point out hardcoded credentials or API keys
- Flag any use of `eval()` or `Function()` constructors

## Documentation standards

- Complex algorithms MUST include explanatory comments

Chunk will automatically detect these files and apply their rules when generating fixes. If you skip this step, it will still produce working patches, but they may not fully match your team’s preferred patterns.

Reviewing Chunk’s work

When Chunk identifies and fixes a flaky test, it opens a pull request with the proposed fix — complete with a clear explanation of what caused the issue, what was changed, and how the fix was verified.

Each PR includes:

Run summary: A concise overview of the issue Chunk detected and the fix it generated.
Root cause: A plain-language explanation of what made the test flaky (for example, non-deterministic timers, random data, or overly strict assertions).
Proposed fix: The specific code changes Chunk made to stabilize the test.
Verification: A summary of the test reruns showing whether the flake was eliminated.

When available, the PR may also include evidence from past CI runs or failure logs that support the analysis.

You can also keep track of Chunk’s work directly in the CircleCI web app. The Chunk Tasks page provides a real-time view of every test the agent has analyzed, showing its current status and result.

Chunk task page

Each row represents a test under analysis, with the project name, completion time, and whether a pull request was opened. This makes it easy to see at a glance which tests have been fixed and which ones still need attention.

Clicking any task opens a detailed report.

Chunk details

The Run Summary section at the top consolidates the most important details: the detected root cause, the applied fix, and verification results from reruns. Below that, you’ll find two tabs:

Code diff: Shows the exact code changes Chunk proposed, side-by-side with your original test. This view makes it easy to review or merge the fix directly from the CircleCI UI.
Logs: Provides the agent’s full reasoning, analysis steps, and validation output. Use this tab to understand how Chunk reached its conclusions, especially if a fix didn’t fully verify or if you’re debugging a failed attempt.

Even when no PR is created (for example, if Chunk couldn’t confidently verify a fix) the results still appear in the dashboard. You can review the reasoning, adjust your configuration, and re-run the analysis later.

(Optional) Running ad hoc tasks

Sometimes you might want Chunk to handle a one-off change outside its regular maintenance cycle, such as cleaning up deprecated code or standardizing parts of your test suite.

In these cases, you can run an ad hoc task. Navigate to Organization Settings → Chunk Settings, select the three dot menu, and click Submit Ad Hoc Task. From any existing branch, describe the task you want Chunk to complete (for example, “update all deprecated API calls to the latest client library”). Chunk will make the change and push it to that branch.

Ad-hoc tasks use the same environment defined in your .circleci/cci-agent-setup.yml file, so they run with the same context and dependencies as your scheduled maintenance runs. You can review the results in the Chunk Tasks dashboard alongside your flaky test fixes.

Improving over time

After your first scheduled run, check the Chunk Tasks page to see what Chunk processed. The logs show where it succeeded and where it ran into issues. Use this feedback to make your configuration more effective.

If Chunk had trouble running tests, add missing dependencies to cci-agent-setup.yml or clarify test commands in your instruction files. If fixes don’t match your team’s patterns, add more specific guidance to your documentation files.

As you review Chunk’s PRs, look for patterns. Repeated timing issues might mean you need better async handling patterns documented. Frequent random data problems might call for test data factories. These patterns reveal opportunities to prevent new flaky tests from being introduced.

Once you’re comfortable with Chunk’s output, adjust your operational limits based on your team’s review capacity. More concurrent PRs means faster progress. If PRs pile up too quickly, dial back the concurrency.

What’s next

Once Chunk runs reliably on schedule, it becomes infrastructure. Your team stops thinking about flaky test debugging as developer work and starts thinking about it as automated maintenance that happens in the background.

No more “rerun until green” or late-night debugging. With Chunk fixing flaky tests in the background, you can start your day with verified PRs waiting for review.

Ready to get started? Sign up for early access to Chunk. For troubleshooting, updates, and community support as you’re setting up, check the CircleCI Community Forum discussion on Chunk Tasks.