The power of LLMs to solve real-world problems is undeniable, but unfortunately, in some cases, only theoretical. What’s stopping us from getting the most out of OpenAI’s text completion capabilities in production apps? One common problem is the inability to confidently guard against bad outputs in production the way we’re used to doing with non-AI test suites.

Let’s go one step deeper. There is no equivalent of code coverage for an LLM. There is also the fact that AI models are probabilistic, which means they can randomly and unpredictably produce different outputs for a given input. (For readers who have used APIs like OpenAI, you might have noticed that this can happen even with temperature set to 0).

In this tutorial, you’ll learn about writing automated tests for LLM application components to help you confidently develop and release AI-powered applications. We’ll start with an existing application, experiment with a prompt change, and show that automated tests can be used as part of continuous integration (CI) to accept good changes and reject bad ones, similar to ordinary unit tests. We’ll use AIConfig to manage, define, and test our LLM application. It uses a specialized JSON config to ease editing, application development, and test cases.


To follow along with this tutorial, you’ll need the following tools:

You can access all the files used in this tutorial in our sample repository.

Our example LLM application

Our example is a command-line application for answering natural language questions about a book database.

App demo

The app uses LLMs for two steps: (1) to infer the correct database API call from the user query, and (2) to generate natural language output based on the API response and original user query. We’ll do a little prompt engineering to make the outputs more user-friendly, define a few quality guardrail metrics, and run automated tests over those metrics to prevent merging a bad prompt change.

The application works as follows:

  1. Takes a natural language query as input
  2. Uses LLM function calling to infer the correct book DB API function call
  3. Runs that function, then combines the resulting data with the original user query
  4. Passes the result of 3 into another LLM call to generate the final output

We’ll configure the prompts in AIConfig format and use the AIConfig SDK to run our prompts in a few lines of code. We’ll start by writing test cases and success conditions for our LLM prompts using the AIConfig evaluation module. (You are writing your tests first, right?)

The AIConfig SDK (used in the app itself) and evaluation module run the configured prompts in exactly the same way, so we know that we’re testing the same LLM functionality that our app runs. We’ll specify success conditions as automated tests using pytest and run them as part of an automated CI pipeline on CircleCI.

You can review the tests in our file. Here’s a quick rundown:

  • test_function_accuracy() tests the accuracy of function calls generated from user queries by comparing them to expected function calls, with a minimum accuracy threshold of 0.9.
  • test_book_db_api tests various functions of the book database API by ensuring the results match expected values.
  • test_threshold_reasoning_string_match tests for expected substrings in generated responses, with a minimum threshold of 0.4.
  • test_threshold_book_recall tests whether the application correctly recalls book names from the database based on the criteria provided in the user query, with a minimum recall rate of 0.75.
  • test_e2e_correctness_1 and test_e2e_correctness_2 are end-to-end tests verifying the application’s overall response accuracy to specific queries about book popularity and listing books by genre.

Because the LLM output is unpredictable, we use custom metrics (is_correct_function and book_recall) to evaluate test outcomes against provided thresholds. By including both standard string-matching tests with more complex LLM evaluations, we ensure that the application not only functions correctly but also meets a high standard of accuracy and completeness in its responses.

Project setup

To create the project on your local machine, fork the example application, navigate to the project directory, and install requirements:

pip install -r requirements.txt

Create a CircleCI Context

Our application requires an API key for OpenAI’s hosted models. You can securely store the API key using CircleCI’s Contexts feature. Here we set up a context for a single job. Please refer to CircleCI’s Guide for Using Contexts for more advanced usage.

In the CircleCI UI, go to Organization Settings > Contexts.

Add context

Then select Create a Context. On the context form enter: cci-last-mile-example for the name.

Create context

Next, you’ll need to add your API key to the context. Edit the cci-last-mile-example context and click Add Environment Variable. Enter: OPENAI_API_KEY as the name. Paste your OpenAI API key as the value.

Add env var

Click Add Environment Variable to finish adding your context variable.

Set up a CircleCI project

Now we’ll set up our project to build on CircleCI.

From the project dashboard in the CircleCI web app, find your forked repository and click Set Up Project. When prompted for a branch enter main and click Set Up Project.

Note that if you authenticated through the CircleCI GitHub App instead of through the legacy OAuth flow, you will need to follow slightly different steps to set up your project.

Our sample repository includes a CircleCI config file that defines the series of steps we want to automate in our CI workflow.

version: 2.1

  python: circleci/python@2.1.1

# Define a job to be invoked later in a workflow.
# See:
      - image: cimg/python:3.12.1
      - checkout
      - run:
          name: Install
          command: pip install -r requirements.txt
      - run:
          name: Run assistant evals.
          command: python -m pytest --junitxml results.xml
      - store_test_results:
          path: results.xml

# Orchestrate jobs using workflows
# See:
      - build-app:
          context: cci-last-mile-example

In this file we define a simple workflow that contains just one job, build-app. This job uses CircleCI’s Python convenience image to check out our application code, install our dependencies, run our test file, and store the results of those tests for additional analysis.

Trigger a build for your project

With your project set up and config file in place, you can trigger a pipeline run by making a change in your project repository.

Create a feature branch for experimenting with prompt changes by running the following commands:

git checkout -b circleci-tests

git push origin circleci-tests

This will trigger a failing build on our function accuracy tests. You should see a failure like the one below.

CI fail

Notice that the test is failing because we expected a query for To Kill a Mockingbird to use the search function, and instead it used the get function.

To address this issue, we can tweak our prompt using the AIConfig editor to give the model more guidance for calling functions.

Improve LLM performance using the AI config file

To get an idea of how AIConfig works, you can start at the system prompt in the JSON file.

You can run the test suite locally with this command:


This should fail with the same error you received in the CI pipeline. Using the information from your failing tests, you can update your application prompt for more accurate performance.

You can edit the configuration file using the AIConfig editor. Start it by running this command:

aiconfig edit --aiconfig-path book_db_function_calling.aiconfig.json

First, update the search function’s description to search queries books by their name and returns a list of book names and their ids. This will give the model more context about how to interpret inputs as parameters when selecting the function.

Update search function description

Next update the get function description to:

get returns a book's detailed information based on the id of the book. Note that this does not accept names, and only IDs, which you can get by using search.

We’ll also change the parameter name to id to make it clearer how to interpret ISBN codes.

Change parameter to ID

Click Save in the editor to save your changes.

Next, update the function calling test to expect the new id parameter:


async def test_function_accuracy():
    test_pairs: list[tuple[dict[str, str], JSON]] = [
            {"user_query": "ID isbn123"},
            {"arguments": '{\n  "id": "isbn123"\n}', "name": "get"},

Finally, because you changed the function parameter in the AI config, you also need to update the get function in to use the id parameter instead of the book parameter.

def call_function(name: str, args: str) -> Book | None | list[Book]:
    args_dict = json.loads(args)
    match name:
        case "list":
            return list_by_genre(args_dict["genre"])
        case "search":
            return search(args_dict["name"])
        case "get":
            return get(args_dict["id"])
        case _:
            raise ValueError(f"Unknown function: {name}")

Validate your changes

You can test out the change locally by running the following command:


The tests should all pass now.

Local tests pass

If you want to double check, you can also manually run the app and give it the queries used in the test cases. You’ll get debug logging in the console as well as additional AIConfig library logs in aiconfig.log.

Finally, commit your changes and push to the feature branch to run the full CI pipeline.

git add

git commit -m "Fix failing test."

git push

These tests should pass.

CI success

If you were able to get everything working, congrats! Tightly constraining the behavior of an AI/machine learning (ML) model is not easy, especially as you add more test cases. This is why we used thresholded metrics instead of requiring that every test case passes. In this way, language models can be evaluated a little more like other ML models: using well-understood metrics like accuracy for a classifier, or recall for a retriever. Ideally, we achieve a combination of the best of traditional software testing and ML model evaluation.

Conveniently, we can configure all of our automated tests and evaluations using the same tools (namely, pytest and CircleCI). This lets us monitor for individual failing test cases in addition to the thresholded LLM statistics we set, without needing new surfaces or processes.


In this tutorial, we used automated testing and CI procedures similar to traditional unit tests to control the quality of our LLM-powered question-answering app. We tested a change to one of the LLM prompts in our AIConfig and saw how that change inadvertently regressed output quality. Like a regular unit test does for procedural code, we prevented a bad prompt change from merging.

Using this general approach, we hope you’ll feel more confident making changes to your LLM applications, or starting to leverage them to begin with if you haven’t yet. It should be pleasant, not painful, to experiment with your prompts and other model configuration without worrying about breaking prod. If you want to dive deeper, head on over to the AIConfig website or GitHub.