TutorialsOct 6, 202513 min read

Automated RAG pipeline evaluation and benchmarking with RAGAS

Muhammad Arham

Senior NLP Researcher

Retrieval-Augmented Generation (RAG) pipelines have become an integral part of how Large Language Models (LLMs) access information beyond their training cutoff. These pipelines enable LLMs to deliver current, accurate, and grounded responses. By fetching relevant external documents, RAG mitigates common LLM challenges like factual inaccuracies and hallucinations. However, this methodology introduces a new complexity: evaluating RAG pipeline performance is particularly challenging. Is an unsatisfactory answer due to a poor retrieval system fetching irrelevant context, or a sub-optimal generator failing to synthesize a coherent response from good context? Pinpointing these issues manually is a slow, subjective, and unscalable task.

To build reliable and trustworthy RAG applications, you need objective, quantifiable metrics. Traditional LLM evaluations like BLEU score (designed for machine translation) or ROUGE score (designed for summarization) often fall short for RAG evaluation as theywere designed for different tasks and lack the contextual understanding needed to assess retrieval-generation alignment. This is where specialized tools like RAGAS come in. RAGAS provides a framework to measure critical aspects of RAG output quality, such as faithfulness (how well the answer aligns with the retrieved context) and context relevance (how pertinent the retrieved information is to the query). These metrics enable you to benchmark your RAG system’s performance, track improvements, and diagnose regressions effectively.

But a benchmark is only useful if it’s consistently applied. Integrating RAG evaluation into your CI/CD pipeline is key to maintaining quality as your RAG system evolves. By automating evaluations with tools like CircleCI, you can trigger performance checks with every code change, catching potential regressions before they impact users. This ensures the stability of your RAG pipeline over time, even as the codebase evolves.

This tutorial will guide you through setting up a simple RAG pipeline using LangChain for orchestration and FAISS for vector storage. You will use a real-world RAG benchmark dataset, databricks/dolly-15k, to populate your knowledge base. Then, you’ll integrate RAGAS for automated, metric-driven evaluation. Finally, you will see how to establish a CircleCI workflow to execute these evaluations automatically, providing continuous quality assurance. You will will use API-based LLMs and embedding models from TogetherAI. These models are easy to use and scale, explaining how to securely manage their credentials within your CI environment.

Prerequisites

To follow this tutorial, you will need a GitHub account and a CircleCI account to run the automated pipeline. TogetherAI is used as the LLM provider, which offers $25 free credits; enough to follow this guide.

Here is what you need to set up:

Installing required packages and setting up codebase

To begin, you will need to set up your Python environment and install libraries. First, create a new directory for your project. Open it and run:

mkdir rag-evaluation-pipeline
cd rag-evaluation-pipeline

Next, it is best practice to create a virtual environment to manage your project’s dependencies. This isolates your project’s packages from your system-wide Python installation.

To create a new virtual environment and activate it, open your terminal and run:

python3 -m venv venv 
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

You will be using several external Python packages that provide essential functionality for building, evaluating, and automating the RAG pipeline in this tutorial.

These are the key packages you will need:

  • langchain: The core framework for building LLM applications. You will use it to orchestrate your RAG pipeline, connect to LLMs, and manage retrieval.
  • faiss-cpu: A library for efficient similarity search and clustering of dense vectors. It serves as your vector store for fast retrieval of relevant documents.
  • ragas: The specialized framework for evaluating RAG pipelines, providing metrics like faithfulness and context relevance.
  • datasets: Used to easily load and manage datasets from the Hugging Face Hub, such as the databricks/dolly-15k dataset you’ll utilize.

Create a requirements.txt file in your project root. Enter this content:

ragas==0.2.15
faiss-cpu==1.11.0
langchain==0.3.25
together=1.5.8
datasets==3.6.0
numpy==2.2.6
pandas==2.2.3
langchain-together==0.3.0

Install these dependencies using pip:

pip install -r requirements.txt

Setting up your API key

For your RAG pipeline to interact with TogetherAI’s models, you need to provide your API key. It is important that you never hard code API keys directly into your scripts or commit them to version control.

Instead, use environment variables.

Create a file named .env in your project’s root directory. Add your TogetherAI API key:

TOGETHER_API_KEY=<YOUR_TOGETHER_API_KEY_HERE>

Replace YOUR_TOGETHER_API_KEY_HERE with the actual API key you obtained from your TogetherAI account. If don’t have one yet, you can obtain it from the TogetherAI settings. While you won’t directly load this .env file in your Python script (CircleCI handles environment variables differently), it is good practice for local development and a reminder of the required variable. Your Python code will read this variable from the environment.

Setting up the RAG evaluation dataset

For effective RAG pipeline evaluation, the quality and structure of your dataset are paramount. A suitable dataset for RAG evaluation typically requires three core components for each entry:

  • Query (Question): The input users will provide to your RAG pipeline.
  • Relevant Context(s): The factual documents from which the answer should be retrieved, vital for assessing your retriever.
  • Ground-Truth Answer: The verified correct response, used to evaluate your generator’s factual accuracy.

Manually creating these datasets is possible but very time-consuming. That’s why you’ll use the high-quality databricks/databricks-dolly-15k dataset from the Hugging Face Hub. This human-generated dataset is ideal because its closed_qa category perfectly aligns with your needs for this tutorial: it provides instruction (query), context (relevant context), and response (ground-truth answer).

To optimize for CI/CD efficiency and reduce API costs, you won’t use the entire dataset. Instead, you’ll sample a manageable portion using DOCUMENTS_SAMPLE_SIZE for populating the vector store. You’ll use QUERY_SAMPLE_SIZE for the actual RAG evaluations. These parameters, configurable in your main.py (and later, via CircleCI pipeline parameters), allow you to balance evaluation thoroughness with computational load. Feel free to adjust them for more comprehensive assessments.

Create rag_pipeline/dataloader.py. Add this code to load and process the dataset:

# File Name: rag_pipeline/dataloader.py

import typing as T
import os
from datasets import load_dataset, Dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

class DollyDataLoader:
    """Loads and preprocesses the Dolly dataset for RAG evaluation"""
    def __init__(
        self,
        dataset_path: str = "databricks/databricks-dolly-15k",
        category: str = "closed_qa",
        split: str = "train",
        sample_size: T.Union[int, None] = None
    ) -> None:
        self.dataset_path = dataset_path
        self.category = category
        self.sample_size = sample_size
        self.split = split
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

    def load_data(self) -> T.Tuple[T.List[Document], Dataset]:
        """
        Loads the Dolly dataset, filters by closed_qa category, samples, and prepares documents for the vector store.

        Returns:
            A tuple containing:
            - List[Document]: LangChain Document objects for the vector store.
            - Dataset: Sampled Hugging Face dataset for RAG queries and ground truths.
        """

        dataset = load_dataset(self.dataset_path, split=self.split)
        dataset = dataset.filter(
            lambda x: x['category'] == self.category and x['context'] is not None and x['context'].strip() != "",
            num_proc=os.cpu_count()
        )
        if self.sample_size:
            dataset = dataset.shuffle(seed=42).select(
                range(min(self.sample_size, len(dataset)))
            )

        documents_content = [item["context"] for item in dataset]
        langchain_documents = [Document(page_content=content) for content in documents_content]
        return self.text_splitter.split_documents(langchain_documents), dataset

Setting up your RAG pipeline

With your data ready, the next step is to construct the RAG pipeline itself. This involves two core components: a vector store for efficient document retrieval and a retrieval chain to integrate the LLM. For this, you will organize your code into rag_pipeline/model_provider.py (for model initialization) and rag_pipeline/pipeline.py (for the RAG orchestration).

First, define how you access your LLM and embedding models. These are handled by the LLMProvider and EmbeddingProvider classes. These classes abstract away the specifics of integrating with API-based models, such as TogetherAI, ensuring your core pipeline remains clean and model-agnostic.

Create rag_pipeline/model_provider.py and add the following code:

# File Name: rag_pipeline/model_provider.py

import os
from langchain_together import TogetherEmbeddings, ChatTogether
from langchain_core.embeddings import Embeddings
from langchain_core.language_models.chat_models import BaseChatModel

class LLMProvider:
    """Manages the initialization of the LLM for the RAG pipeline"""
    def __init__(self, model_name: str) -> None:
        self.model_name = model_name

    def get_llm(self) -> BaseChatModel:
        if "TOGETHER_API_KEY" not in os.environ:
            raise ValueError("TOGETHER_API_KEY environment variable not set")
        return ChatTogether(model=self.model_name)

class EmbeddingProvider:
    """Manages the initialization of the Embedding model for vectorizing documents"""

    def __init__(self, model_name: str) -> None:
        self.model_name = model_name

    def get_embedding_provider(self) -> Embeddings:
        if "TOGETHER_API_KEY" not in os.environ:
            raise ValueError("TOGETHER_API_KEY environment variable not set")
        return TogetherEmbeddings(model=self.model_name)

Next, the RAGPipeline class in rag_pipeline/pipeline.py orchestrates the entire process:

# File Name: rag_pipeline/pipeline.py

import typing as T
from tqdm import tqdm
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.embeddings import Embeddings
from langchain.docstore.document import Document
from datasets import Dataset

class RAGPipeline:
    def __init__(self, llm: BaseChatModel, embedding_provider: Embeddings) -> None:
        self.llm = llm
        self.embedding_provider = embedding_provider
        self.vectorstore: FAISS = None
        self.qa_chain: RetrievalQA = None

    def build_vector_store(self, documents: T.List[Document]) -> None:
        print("Embedding documents with FAISS vectorstore")
        self.vectorstore = FAISS.from_documents(documents, self.embedding_provider)

    def setup_qa_chain(self) -> None:
        if not self.vectorstore:
            self.build_vector_store()
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=self.vectorstore.as_retriever()
        )

    def run_queries(
        self,
        hf_dataset_sample: Dataset,
        query_sample_size: int = 30
    ) -> T.List[T.Dict[str, str]]:
        """
        Runs RAG queries against the pipeline and collects results for RAGAS evaluation.

        Args:
            hf_dataset_sample: The sampled Hugging Face dataset containing questions and ground truths.
            query_sample_size: The number of queries to run from the sampled dataset.

        Returns:
            A list of dictionaries, each containing 'question', 'answer', 'contexts', and 'ground_truth'.
        """
        if not self.qa_chain:
            self.setup_qa_chain()

        sampled_queries_dataset = hf_dataset_sample.shuffle(seed=42).select(
            range(min(query_sample_size, len(hf_dataset_sample)))
        )
        results = []
        for item in tqdm(sampled_queries_dataset, desc="Generating responses for queries"):
            query = item["instruction"]
            ground_truth = item["response"]

            # Generates response for a query
            response = self.qa_chain.invoke({"query": query})   
            answer = response["result"]

            # Only fetches the relevant documents from knowlege base used as context for the response
            retrieved_docs = self.qa_chain.retriever.invoke(query)  
            contexts = [doc.page_content for doc in retrieved_docs]

            results.append({
                "question": query,
                "answer": answer,
                "contexts": contexts,
                "ground_truth": ground_truth
            })
        return results

You will first load your documents and use the EmbeddingProvider to convert them into numerical vectors. which are then indexed by FAISS using build_vector_store. This FAISS index acts as your knowledge base.

Then, the setup_qa_chain method configures LangChain’s RetrievalQA chain, which creates an internal pipeline to retrieve relevant context from the vector database and provides LLM with context for generating an answer. When run_queries is called, this chain takes a user’s question and uses FAISS to find the most relevant document chunks. It then feeds those chunks along with the question to the LLM (provided by LLMProvider) to generate a comprehensive answer. This structured approach allows for modularity and clear separation of concerns in your RAG pipeline.

Evaluating RAG pipeline with RAGAS

Standard metrics often assess only the generative output, and don’t account for the crucial role of the retrieved context. This limitation makes it difficult to diagnose performance bottlenecks, whether they stem from ineffective retrieval or inaccurate generation. RAGAS (RAG Assessment) is a specialized framework designed to overcome these challenges by providing quantifiable, context-aware metrics for RAG systems.

Instead of relying on manual annotation or simplistic string matching, RAGAS follows the LLM-as-a-judge paradigm. LLM-as-a-judge employs a dedicated (usually more capable) LLM, called the “evaluator LLM” to programmatically score your RAG pipeline’s outputs.

This system provides RAG-specific diagnostics and metrics like:

  • Faithfulness: This metric quantifies the factual consistency between the generated answer and the retrieved context. A higher faithfulness score indicates that the RAG pipeline’s LLM is not introducing information unsupported by the provided source material, directly combating hallucination.
  • LLM context recall: This metric evaluates the completeness of the retrieved context with respect to the ground-truth answer. A high score signifies that your retriever is effectively identifying and providing all the necessary information required to derive the correct answer from the knowledge base.
  • Factual correctness: This metric assesses the factual accuracy of the generated answer by comparing it directly against the ground-truth answer. It provides an end-to-end measure of how accurate the final output of your RAG pipeline is.

These metrics collectively provide a balanced and actionable view of your RAG pipeline’s performance across both its retrieval and generation phases.

Implementing the RAGAS evaluator

The RAGAS evaluation logic is encapsulated within the RAGASEvaluator class, located in rag_pipeline/evaluator.py.

This class receives the outputs from your RAG pipeline:

  • Questions
  • RAG-generated answers
  • Contexts retrieved by your vector store
  • The Dolly-15k ground truths

It then orchestrates their evaluation using the RAGAS framework.

Create rag_pipeline/evaluator.py file with this code:

# File Name: rag_pipeline/evaluator.py

import typing as T
import pandas as pd
from ragas import evaluate
from ragas.metrics import Faithfulness, FactualCorrectness, LLMContextRecall
from datasets import Dataset
from langchain_core.language_models.chat_models import BaseChatModel

class RAGASEvaluator:
    """Handles the evaluation of RAG pipeline results using RAGAS metrics"""
    def __init__(self, evaluator_llm: BaseChatModel) -> None:
        self.evaluator_llm = evaluator_llm
        # -- Using relevant RAG metrics: Change as required -- #
        self.metrics = [Faithfulness(), FactualCorrectness(), LLMContextRecall()]

    def evaluate_results(self, rag_results: T.List[T.Dict[str, str]]) -> pd.DataFrame:
        """
        Performs RAGAS evaluation on the collected RAG results.

        Args:
            rag_results: A list of dictionaries containing 'question', 'answer', 'contexts', and 'ground_truth'.

        Returns:
            A pandas DataFrame with the RAGAS evaluation scores.
        """
        data = {
            "question": [r["question"] for r in rag_results],
            "answer": [r["answer"] for r in rag_results],
            "contexts": [r["contexts"] for r in rag_results],
            "ground_truth": [r["ground_truth"] for r in rag_results],
        }
        dataset = Dataset.from_dict(data)

        print("Performing RAGAS evaluation...")
        result = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.evaluator_llm
        )
        return result.to_pandas()

The evaluate_results method transforms the collected RAG outputs into a Hugging Face Dataset object, the required input format for ragas.evaluate(). It then invokes ragas.evaluate(), passing in the prepared dataset, the specified metrics, and the dedicated evaluator_llm responsible for judging the quality. The method returns a pandas DataFrame, providing a clear tabular summary of the scores for each metric across all processed queries.

Orchestrating and testing RAG evaluation locally

The main.py file serves as the orchestrator, bringing together all the modular components: the data loader, the LLM and embedding providers, the RAG pipeline, and the RAGAS evaluator. It defines the workflow, instantiates the necessary classes, and executes the evaluation process end-to-end.

Create main.py in your project’s root directory and add this code:

# File Name: main.py

import os
import json

from rag_pipeline.model_provider import LLMProvider, EmbeddingProvider
from rag_pipeline.dataloader import DollyDataLoader
from rag_pipeline.pipeline import RAGPipeline
from rag_pipeline.evaluator import RAGASEvaluator

EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "BAAI/bge-base-en-v1.5")
LLM_MODEL_NAME = os.getenv("LLM_MODEL_NAME", "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")
EVALUATOR_LLM_NAME = os.getenv("EVALUATOR_LLM_NAME", "Qwen/Qwen2.5-7B-Instruct-Turbo")

DOCUMENTS_SAMPLE_SIZE = int(os.getenv("DOCUMENTS_SAMPLE_SIZE", 100))
QUERY_SAMPLE_SIZE = int(os.getenv("QUERY_SAMPLE_SIZE", 5))

print(f"EMBEDDING_MODEL_NAME: {EMBEDDING_MODEL_NAME}")
print(f"LLM_MODEL_NAME: {LLM_MODEL_NAME}")
print(f"EVALUATOR_LLM_NAME: {EVALUATOR_LLM_NAME}")
print(f"DOCUMENTS_SAMPLE_SIZE: {DOCUMENTS_SAMPLE_SIZE}")
print(f"QUERY_SAMPLE_SIZE: {QUERY_SAMPLE_SIZE}\n")

def run_evaluation():
    # -- Load a partition of Dolly-15K dataset -- #
    dataloader = DollyDataLoader(sample_size=DOCUMENTS_SAMPLE_SIZE)
    documents_for_vectorstore, hf_dataset_for_queries = dataloader.load_data()
    if len(hf_dataset_for_queries) == 0:
        print("No samples found matching the criteria. Please check dataset name, category, and sample size.")
        exit(1)

    print(f"Number of documents loaded for vector store: {len(documents_for_vectorstore)}")
    print(f"Number of evaluation samples from HF dataset (before query sampling): {len(hf_dataset_for_queries)}")

    # -- Load Embedding and Chat LLM Models -- #
    embedding_provider = EmbeddingProvider(model_name=EMBEDDING_MODEL_NAME).get_embedding_provider()
    llm_provider = LLMProvider(model_name=LLM_MODEL_NAME).get_llm()
    evaluator_llm_provider = LLMProvider(model_name=EVALUATOR_LLM_NAME).get_llm()

    # -- Set up RAG pipeline -- #
    # -- Vectorizes and stores documents in FAISS knowlegde base. Then sets up a RetreivalQA chain -- #
    # -- Given a query, it will get relevant chunks from vectorstoreand answer based on that context -- #
    rag_pipeline = RAGPipeline(llm_provider, embedding_provider)
    rag_pipeline.build_vector_store(documents_for_vectorstore)
    rag_pipeline.setup_qa_chain()

    print("Running queries from Dolly dataset and collecting data for RAGAS...")
    rag_results = rag_pipeline.run_queries(hf_dataset_for_queries, QUERY_SAMPLE_SIZE)

    with open("rag_results.json", "w") as _f:
        json.dump(rag_results, _f, indent=4)
        print("RAG query results saved to rag_results.json")

    # -- Evalaute generated responses with specific RAGAS metrics. -- #
    ragas_evaluator = RAGASEvaluator(evaluator_llm_provider)
    evaluation_df = ragas_evaluator.evaluate_results(rag_results)

    # -- Ragas returns a score for each sample, use mean for overall -- #
    # -- The key values are auto-generated by RAGAS based on metric type -- #
    faithfulness_score = evaluation_df["faithfulness"].mean() 
    context_recall_score = evaluation_df["context_recall"].mean()
    factual_correctness_score = evaluation_df["factual_correctness(mode=f1)"].mean()

    print("\n--- RAGAS Evaluation Results ---")
    print(f"Average Faithfulness Score: {faithfulness_score:.2f}")
    print(f"Average Context Recall Score: {context_recall_score:.2f}")
    print(f"Average Factual Correctness Score: {factual_correctness_score:.2f}")

    evaluation_df.to_csv("ragas_results.csv", index=False)

if __name__ == "__main__":
    run_evaluation()

With the main.py file in place, you can now execute your RAG evaluation pipeline from your local machine. Ensure your virtual environment is active and that you have installed all dependencies as described in the “Installing required packages” section of this tutorial.

In the code, key operational parameters like the EMBEDDING_MODEL_NAME, LLM_MODEL_NAME, and evaluation SAMPLE_SIZEs are loaded dynamically from environment variables (with sensible defaults), allowing for flexible configuration. Your TOGETHER_API_KEY must be available in your shell’s environment variables. Remember that os.getenv() directly reads from your shell environment and not the .env file.

On Unix-based systems, export your environment variables from the .env file:

set -a
source .env

Now, execute the main.py script:

python main.py

You will be able to review progress messages as the dataset loads, the vector store is built, queries are processed, and RAGAS computes the evaluation metrics. Upon completion, your console output will show the RAGAS scores, similar to this:

EMBEDDING_MODEL_NAME: BAAI/bge-base-en-v1.5
LLM_MODEL_NAME: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
EVALUATOR_LLM_NAME: Qwen/Qwen2.5-7B-Instruct-Turbo
DOCUMENTS_SAMPLE_SIZE: 100
QUERY_SAMPLE_SIZE: 5

Number of documents loaded for vector store: 191
Number of evaluation samples from HF dataset (before query sampling): 100
Embedding documents with FAISS vectorstore
Running queries from Dolly dataset and collecting data for RAGAS...
Generating responses for queries: 100%|███████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.10s/it]
RAG query results saved to rag_results.json
Performing RAGAS evaluation...
Evaluating: 100%|███████████████████████████████████████████████████████████████████████| 15/15 [00:19<00:00,  1.31s/it]

--- RAGAS Evaluation Results ---
Average Faithfulness Score: 0.95
Average Context Recall Score: 0.93
Average Factual Correctness Score: 0.61

Automating evaluation with CircleCI

You have now built a simple but functional RAG pipeline and RAGAS evaluation system. The final, crucial step is to automate this process to ensure continuous quality and maintain confidence in your RAG application’s performance. Integrating your RAG evaluation into your Continuous Integration/Continuous Delivery (CI/CD) pipeline allows every code change—whether to your retrieval strategy, chunking method, or LLM prompting—to automatically trigger an evaluation. This proactive approach helps you catch performance regressions early, provides immediate feedback to developers, and facilitates reproducible benchmarking, ultimately leading to faster iterations and more reliable deployments.

To integrate your RAG evaluation into CircleCI, you’ll define your CI/CD workflow in a .circleci/config.yml file placed in the root of your GitHub repository. Here is the complete configuration:

# File Name: .circle/config.yml

version: 2.1

parameters:
  embedding_model:
    type: string
    default: "BAAI/bge-base-en-v1.5"
    description: "Embedding model name for the RAG pipeline."
  llm_model:
    type: string
    default: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
    description: "LLM model name for the RAG pipeline."
  evaluator_llm_model:
    type: string
    default: "Qwen/Qwen2.5-7B-Instruct-Turbo"
    description: "LLM model name used by RAGAS for evaluation metrics."
  doc_sample_size:
    type: integer
    default: 100
    description: "Number of documents to sample from Dolly dataset for vector store."
  query_sample_size:
    type: integer
    default: 5
    description: "Number of queries to sample from Dolly dataset for evaluation."

jobs:
  rag_evaluation:
    docker:
      - image: cimg/python:3.10
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: pip install -r requirements.txt
      - run:
          name: Run RAGAS Evaluation
          # Passes parameters to Python process execution environment. Overrides default values.
          command: |
            EMBEDDING_MODEL_NAME="<< pipeline.parameters.embedding_model >>" \
            LLM_MODEL_NAME="<< pipeline.parameters.llm_model >>" \
            EVALUATOR_LLM_NAME="<< pipeline.parameters.evaluator_llm_model >>" \
            DOCUMENTS_SAMPLE_SIZE="<< pipeline.parameters.doc_sample_size >>" \
            QUERY_SAMPLE_SIZE="<< pipeline.parameters.query_sample_size >>" \
            python3 main.py
      - store_artifacts:
          path: ragas_results.csv
          destination: ragas_evaluation_results
      - store_artifacts:
          path: rag_results.json
          destination: rag_raw_results

workflows:
  pipeline:
    jobs:
      - rag_evaluation

This config.yml defines your automated evaluation process. The parameters: section allows you to define variables like model names and sample sizes that can be easily adjusted when you trigger a workflow in the CircleCI UI or via its API, eliminating the need to modify the YAML file for each experiment. These parameters are then seamlessly passed as environment variables to your main.py script using the << pipeline.parameters.parameter_name >> syntax.

The jobs: section defines the core tasks, with rag_evaluation being your primary job. It specifies a docker image as the execution environment and a series of steps. These steps include checking out your code, installing all Python dependencies, and critically, running your main.py script. If main.py exits with a non-zero status code, the CircleCI job will automatically fail, providing immediate feedback. Finally, the store_artifacts steps ensure that your evaluation results (ragas_results.csv and rag_results.json) are saved and accessible directly from the CircleCI dashboard for review and historical tracking.

By pushing this config.yml to your GitHub repository, your automated RAG evaluation pipeline will continuously run, ensuring the quality and performance of your RAG system with every change.

Setting up your project on CircleCI

Before you can set up a project on CircleCI, you first need to upload your code to GitHub. Create a new file named .gitignore. This file defines files and folders that should not be pushed to GitHub. Add this content to the file:

venv/
*/__pycache__/*
rag_results.json
ragas_results.csv
.env

You can now push your code to a GitHub repository.

Next, log into your CircleCI account and create a new project. Before you can trigger the pipeline, you need to configure environment variables. On the CircleCI sidebar, select Projects, click the ellipsis in your projects row and select Project Settings.

Opening project settings

On the project settings page, select Environment Variables on the sidebar and add an environment variable with the key “TOGETHER_API_KEY” and assign the your Together API key as the value.

Adding an environment variable

You can now trigger the pipeline manually. It should execute successfully.

Successful execution

You can now navigate to the Artifacts tab for the completed job in CircleCI. You will find the generated files such as rag_results.json and ragas_results.csv there. These files are downloadable and can be used to periodically track your RAG performance.

Generated artifacts

You can access the full code for this project on GitHub.

Conclusion

As Retrieval-Augmented Generation (RAG) systems scale, integrating automated and reproducible evaluations is critical. Evaluations are vital for maintaining reliability, accuracy, and user trust in these systems.

In this tutorial, you have learned how to establish a RAG pipeline using LangChain and FAISS, integrate it with datasets like Dolly-15k, and, most importantly, automate its continuous quality assurance with RAGAS and CircleCI. This automated setup empowers you to catch performance regressions early, enabling rapid iteration and confident deployment.

As a crucial next step, we highly recommend setting explicit RAGAS score thresholds within your main.py script and your CircleCI configuration. By doing so, your CI/CD pipeline can automatically fail if RAG performance drops below an acceptable baseline, acting as an essential quality gate. Automated evaluations are especially vital for advanced use cases:

  • Evaluating the grounding of enterprise document search tools.
  • Detecting potential hallucinations in critical applications like healthcare or legal assistants.
  • Continuously monitoring the performance of LLM-powered chatbots.

Creating this kind of automation ensures your RAG systems remain robust and reliable with frequent changes at scale.


Muhammad Arham is a Deep Learning Engineer specializing in Computer Vision and Natural Language Processing, with experience developing globally top-ranking generative AI applications for the Google Play Store. He is passionate about constructing and optimizing machine learning models for intelligent systems, and firmly believes in continuous improvement within this rapidly evolving field.