Automate LLM evaluation testing with the CircleCI Evals orb

Cloud

Server v4+

CircleCI supports running automatic large language model (LLM) evaluations and evaluation testing using your preferred GenAI application evaluation framework. Through declaring the necessary commands in your pipeline configuration you can run your LLM evaluations and test the results within your CircleCI pipeline.

With the CircleCI Evals orb, you can remove the manual work involved in triggering LLM evaluations, reviewing evaluation results, and determining whether the results meet set expectations.

To learn more about LLM evaluations, read our article on methods for testing LLM-enabled applications through evaluations.

Concepts

LLM Evaluations: A method to assess the performance and accuracy of AI models and algorithms. Evaluations measure metrics such as task fidelity, consistency, relevance and coherence, or tone and style.
Evaluation results: The set of quantitative (often numeric) results output by running an evaluation.
Metrics: 1 or more units of data, used as inputs for testing.
Evaluation test: A test that determines if input metrics meet specified conditions.
Test suite: A set of test cases that are related and executed. Is a .json file that is assigned a name by the user.
Test case: A test scenario, tests a single concept to make it clear what has failed when the test does not pass. Each has:
- Name: The name assigned to the test case. Usually descriptive of the scenario.
- Assertion: A statement that checks if the specified condition is true. If false, the test fails.
- Result: Each test case has a result of pass or fail.
Evaluation test result:
- Failure indicates a proposed change resulted in a degradation of model performance. The job stops running, and the pipeline fails.
- Success indicates that model performance has met set criteria, the job continues to run.

Prerequisites

A CircleCI account connected to your code. You can sign up for free.
A CircleCI project with a workflow configured to build your code.
CircleCI contexts relevant for your workflow, which includes tools such as pre-trained model providers like OpenAI.
Existing LLM evaluations on a open source library or third-party tools such as Braintrust or Langsmith.

1. Add the CircleCI Evals orb

Use CircleCI version 2.1 at the top of your .circleci/config.yml file.

version: 2.1

Add the orbs stanza below your version, invoking the orb:

orbs:
  evals: circleci/evals@2.0.0

2. Add and define an evaluation job

Define a new job to run your LLM evaluations, save evaluation results and test them. It is important to include the step to save your evaluation results in order to run evaluation testing.

jobs:
  eval-test:
    docker:
      - image: cimg/python:stable
    steps:
      # Add an evaluation step
      - run: <your eval step> # Run evals and output results for the evaluation test
      - evals/test: # Invoke evaluation test
          name: Run Evaluation Test
          assertions: <your path to .json evaluation test suite> # Replace with your path
          metrics: <your path to .json evaluation results> # Replace with your path
          results: <your path to store JUnit XML evaluation test results file> # Replace with your path

3. Create evaluation test suite file

Create a .json file. You can choose the file location, but we suggest placing it in your code repository .circleci directory.
You can customize the filename. We suggest using eval-test.json, since this file represents the test suite for testing evaluation results.

4. Define test cases

In the .json file you just created, define your test cases.
Each test case is composed of a name and an assertion.
For test case names we suggest assigning a name based on the test scenario. For example, using the name of a metric: correctness or toxicity.
Assertions are expressed using the Common Expression Language (CEL). For more information, see the specification. All values from the results .json file are available as global variables. Try out CEL here.

Test case examples

Given metrics from eval-results.json:

{
  "correctness": 0.99,
  "toxicity": 0.2,
  "labels": [
    "CORRECT",
    "ACCURATE",
    "NOT HARMFUL"
  ]
}

Below are example test cases and a demonstration of Common Expression Language (CEL) usage.

For test case names we suggest assigning a name based on the test scenario. For example, using the name of an input metric: correctness or toxicity.

{
  "correctness": "correctness > 0.9"
  "toxicity": "toxicity < 0.01"
  "labels": “labels[0] == \”CORRECT\””
}

Below displays a template, demonstrating that a test case is composed of a name and an assertion:

{
  "yourTestCaseName": "<your assertion>",
}

5. Add your evaluation job to a workflow

Define a new workflow, or use an existing one. Add your newly defined job to the workflow. Under your job, specify the relevant contexts needed to run the tasks in your job.

workflows:
  build-test-eval-workflow:
    jobs:
      - build-test-eval:
          context:
            - <your OpenAI context> # Replace with your context

6. Review results in CircleCI’s web app

Here is an overview of the information you can expect to see in CircleCI’s web app when running a workflow with an evaluation job.

Evaluation results

Your evaluation step details can display a link to the results on your 3rd party LLM evaluations provider. If you need to review them, you can navigate to them directly.

Evaluation test results

The step details will display results for all assertions
The tests tab will surface failed evaluation tests

Examples

Example pipeline configuration

Here’s an example of a pipeline configuration set up with the CircleCI evals orb and job. In the following pipeline configuration example, the job eval-test will:

Checkout the project repository
Run LLM evaluations and store the results
Run evaluations testing

version: 2.1
orbs:
  evals: circleci/evals@2.0.0

jobs:
  eval-test:
    docker:
      - image: cimg/python:stable
	steps:
		- checkout # Checkout project repository
		- run: python evals.py > eval-results.json # Run evals and output results for the evaluation test
		- evals/test: # Invoke evaluation test
		    name: Run evaluation test
		    assertions: .circleci/eval-test.json # Path to evaluation test suite
		    metrics: eval-results.json # Path to evaluation results file
		    results: eval-test-results.xml # Path to stored test results
workflows:
  eval-test-workflow:
    jobs:
      - eval-test
          context:
            - openai-4o

Example evaluation results

Here is an example of evaluation results eval-results.json provided to CircleCI as metrics.

{
  "correctness": 0.99,
  "helpfulness": 0.95,
  "maliciousness": 0.95,
  "relevance": 0.98,
  "labels": [
    "CORRECT",
    "ACCURATE",
    "NOT HARMFUL"
  ]
}

Example evaluation test

Here is an example of an evaluation test suite. It includes five test cases:

Correctness: Ensure the correctness metric is above an acceptable threshold, which is 0.9.
Helpfulness: Ensure the helpfulness metric is above an acceptable threshold, which is 0.9.
Maliciousness: Ensure the maliciousness metric is above an acceptable threshold, which is 0.9.
Relevance: Ensure the relevance metric is above an acceptable threshold, which is 0.95.
Labels: Ensure the first element in the labels array is equal to the specified string.

{
  "correctness" : "correctness > 0.9"
  "helpfulness" : "helpfulness > 0.9"
  "maliciousness": "maliciousness > 0.9"
  "relevance" : "relevance > 0.95"
  "labels": “labels[0] == \”CORRECT\””
}