TutorialsJan 5, 202612 min read

Automating LLM application deployment with BentoML and CircleCI

Muhammad Arham

Senior NLP Researcher

Shipping application code, especially for LLM-based applications, can be a stressful and complex task. These applications demand intricate model management, careful resource allocation, and manual handling of dependency conflicts. Traditionally, preparing such applications for deployment involves integration tests, containerization, and updating image registries: all time-consuming manual steps.

This is where an automated CI pipeline becomes invaluable. Streamlining the deployment process frees developers from the hassle of fixing deployment issues, allowing them to concentrate on core application and business logic. This tutorial will show you how to achieve this level of efficiency using BentoML and CircleCI.

BentoML is an open-source framework to simplify packaging, containerizing, and serving ML applications. It provides an easy-to-use interface that accelerates application release cycles and reduces manual interventions by handling Python environments, allocating resources, and building Docker images. It is a reliable and consistent way to package your application and serve it to your users. Automating the build process with CircleCI pipelines seamlessly integrates your latest application code from version control, automatically updating the deployment registry.

In this tutorial, you will set up a simple LLM-based chat endpoint using BentoML and CircleCI. This example will show you how to implement a similar pipeline to minimize manual deployment efforts so that you can focus on application development.

Prerequisities

For this tutorial, you will need:

Setting up your Python project

Start by setting up your local development environment:

  • Create a dedicated project directory
  • Establish a virtual environment to keep your dependencies organized
  • Install libraries
  • Securely configure your OpenAI API key

Using your terminal, create a new project directory and go into it:

mkdir llm-application
cd llm-application

As a best practice, we recommended creating a Python virtual environment. This isolates your project’s dependencies from your system’s global Python installations, which prevents potential conflicts.

For Unix-based systems, run these commands:

python3 -m venv venv
source venv/bin/activate

Next, define and install your project’s dependencies. Create a requirements.txt file in your project’s root directory. Add this content:

bentoml==1.4.17
openai==1.93.0
pytest==8.4.1
uv==0.7.19
setuptools>=71.0.2,<80.9

Use pip to install these dependencies:

pip install -r requirements.txt

To securely manage your OpenAI API key, create a .env file directly in your project’s root directory. This file will store your sensitive credentials, preventing them from being accidentally committed to your version control system.

OPENAI_API_KEY=<YOUR_API_KEY_HERE>

Once these files are set up, create directories that contain an empty __init__.py file in each of them. This setup tells Python to treat these directories as packages, allowing for proper imports between your agent’s modules and your test files.

On Linux or MacOs, execute these commands:

mkdir tests
touch tests/__init__.py

On Windows Powershell, execute these commands:

mkdir tests
New-Item -ItemType File -Path tests\__init__.py -Force  

With your environment ready and dependencies installed, you can start building.

Building your LLM service with BentoML

In a deployment pipeline, the most important part of the pipeline is the service you are deploying. The service is where you define your application logic, validate your input and output, and then expose it in an API. In a traditional world, you would probably take something like Flask or FastAPI and write HTTP endpoints, then write a Dockerfile separately to containerize your application. This is fine to do, but it can mean that you have to manage a ton of boilerplate, and everybody is effectively reinventing the wheel every time. BentoML eliminates this problem as it is specifically designed for deploying machine learning and LLM applications.

Next, define your LLM service using bentoml. Create a new service.py file and and add this code:

# File Name: service.py

import os
import bentoml
from pydantic import BaseModel, StringConstraints
from typing import Annotated
from openai import OpenAI

class PromptRequest(BaseModel):
    prompt: Annotated[
        str,
        StringConstraints(min_length=1, max_length=1000)
    ]

class CompletionResponse(BaseModel):
    completion: str

bentoml_image = bentoml.images.Image(
    python_version="3.10",
).requirements_file("./requirements.txt")

@bentoml.service(
    image=bentoml_image,
)
class LLMService:
    def __init__(self):
        self.client = OpenAI()

    @bentoml.api(input_spec=PromptRequest, output_spec=CompletionResponse)
    def generate(self, prompt: str) -> CompletionResponse:
        response = self.client.chat.completions.create(
            model=os.getenv("MODEL_NAME", "gpt-3.5-turbo"),
            messages=[{"role": "user", "content": prompt}]
        )
        completion_text = response.choices[0].message.content.strip()
        return CompletionResponse(completion=completion_text)

Here’s an explanation:

  • The service.py file defines the LLMService class, which implements the OpenAI API.
  • The PromptRequest and CompletionResponse definitions are Pydantic models that act as schemata for the incoming request and the outgoing response.
    • PromptRequest guarantees that every incoming request is a validated prompt string,
    • CompletionResponse guarantees the returned completion string every time the service runs.
    • This method of keeping track of the data allows for thorough validation, serialization of the data, and a predictable experience with the API.
  • The bentoml_image configuration is a critical element that instructs BentoML on the construction of the Docker image for the service.
    • It specifies python_version="3.10" and references the requirements.txt file. This ensures that all requisite Python libraries, including BentoML and the OpenAI client, are bundled into the final Docker image, resulting in a self-contained and portable application artifact.
  • The @bentoml.service(image=bentoml_image) decorator designates the LLMService class as a BentoML service, directing BentoML to utilize the specified image configuration for packaging and containerization.
  • Within this service, the __init__ method initializes the OpenAI client, which serves as the interface to OpenAI’s language models.
  • The generate method, annotated with @bentoml.api(input_spec=PromptRequest, output_spec=CompletionResponse), is central to the service’s functionality. This decorator transforms the Python function into an API endpoint. BentoML automates the underlying web server configuration, routing, and HTTP request/response handling.
    • The input_spec and output_spec ensure that incoming data is validated against the PromptRequest schema and that the outgoing response conforms to CompletionResponse, mitigating common API errors.
    • The generate method’s logic involves processing the user’s prompt, submitting it to the OpenAI API, extracting the generated text, and returning it as a CompletionResponse object.

You can refer to the BentoML documentation for additonal parameters and build options. It provides additional support for async task queues, model management, parallel request handling amongst other advanced functionalities.

Running your BentoML service locally

Once you have written your service.py file, you can run your service locally right away using the BentoML CLI. First, export your environment variables from the .env file by running:

set -a 
source .env

Use the BentoML CLI to execute the service file:

bentoml serve

This starts a server on http://localhost:3000 from the service.py file, and adds logging and debugging features. If the server is started correctly, your terminal should display something similar to this log:

2025-07-06T01:50:05+0500 [INFO] [cli] Loading service from default location 'service.py'
2025-07-06T01:50:06+0500 [INFO] [cli] Loading service from default location 'service.py'
2025-07-06T01:50:06+0500 [INFO] [cli] Starting production HTTP BentoServer from "." listening on http://localhost:3000 (Press CTRL+C to quit)
2025-07-06T01:50:07+0500 [INFO] [:1] Loading service from default location 'service.py'
2025-07-06T01:50:07+0500 [INFO] [:1] Service LLMService initialized

You can send HTTP requests to your service using curl or Postman. To test the service, send a request to the /generate endpoint. Use this curl command:

curl --location 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "hello"
}'

You should expect a response similar to:

{
    "completion": "Hello! How can I assist you today?"
}

Testing your LLM application

Before deploying your LLM application, you need to make sure it behaves as expected. Automated testing is an essential part of any CI/CD pipeline, helping you catch errors early and maintain confidence as your code evolves.

Because BentoML services are Python classes, testing them is straightforward. You can write unit tests that directly call your service methods, bypassing the HTTP layer. You aren’t required to start a server to run tests, which keeps testing fast and focused.

Set up a simple unit-test for your application. Create a new tests/test_service.py file, and add this code:

# File Name: tests/test_service.py

from service import LLMService, CompletionResponse

def test_generate():
    service = LLMService()
    result = service.generate("Say hello")
    assert isinstance(result, CompletionResponse)
    assert len(result.completion) > 0
    assert "hello" in result.completion.lower()

From the project root directory, you can execute the tests by running:

pytest -p no:warnings

This executes the tests within the tests directory, and displays this output:

===== test session starts =====
platform darwin -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /Users/muhammadarham/Drive/CircleCIBlogs/BentoMLDeployment
plugins: anyio-4.9.0
collected 1 item

tests/test_service.py .                                                                          [100%]

====== 1 passed in 2.82s =====

Building and containerizing your application

Once your service is defined, tested, and running locally, the next step is to package it for deployment. In BentoML, this process is split into two simple but powerful steps: build and containerize.

Execute this command:

bentoml build

BentoML scans your project and creates a Bento; an immutable, versioned bundle that contains everything your application needs. That includes your Python files, all dependencies, models, and metadata. Bentos are self-contained artifacts that can be versioned, tested, and deployed consistently across environments. This ensures that what you tested locally is exactly what runs in production, avoiding the “it works on my machine” problem.

The name of the Bento is derived from the name of the service class you defined in service.py. For example, if your service class is named LLMService, BentoML creates a Bento bundle with the name llm_service (lowercase by convention). This makes it easy to identify and version your application bundles directly based on your service definition.

To turn this Bento into a Docker image, run:

bentoml containerize llm_service:latest

This builds a production-ready Docker image based on your service definition and configuration. Internally, BentoML uses the metadata defined in service.py and the generated Bento to construct the image, including the specified base image, your package dependencies, system packages, and service entry points.

Now, to complete the pipeline, push the Docker image to the remote registry you use for deployment. For this tutorial, use the public registry on Docker Hub. If you do not already have a Docker Hub account, create one at hub.docker.com and generate an access token. You will need your username and password token to push images.

Log into Docker Hub from your terminal:

docker login

Tag your image:

docker tag llm_service:latest <your_dockerhub_username>/llm_service:latest

Push the image to Docker Hub:

docker push <your_dockerhub_username>/llm_service:latest

Because the Docker image is built directly from your application’s service definition, you do not need to maintain a separate Dockerfile or deployment script. If you want to change the base image, add labels, or adjust container behavior, you can do it through your BentoML configuration. This keeps everything application-centric and easy to modify.

This clean workflow, from Python code to versioned Bento to containerized image, is one of BentoML’s key advantages. It eliminates the need to manage deployment artifacts separately, keeps your configuration close to your application code, and ensures reproducibility and consistency in every environment.

Automating testing and deployment with CircleCI

Once your service is ready and tested locally, the next step is to automate the entire build and deployment process. This ensures that every time you push changes to your repository, your application is tested, built, and deployed without any manual steps. Automation not only saves time, it reduces the chance of human error and keeps your deployment reproducible and consistent.

For this tutorial, you will use CircleCI, a cloud-based continuous integration and delivery (CI/CD) platform that integrates seamlessly with your code repository and Docker Hub.

Here is a config.yml file you can place in your project under .circleci/config.yml:

# File Name: .circleci/config.yml

version: 2.1

executors:
  python-docker:
    docker:
      - image: cimg/python:3.10

jobs:
  test_service:
    executor: python-docker
    steps:
      - checkout
      - run: python -m pip install --upgrade pip
      - run: pip install -r requirements.txt
      - run: pytest tests/

  build_and_deploy_bento:
    executor: python-docker
    steps:
      - checkout
      - setup_remote_docker
      - run: python -m pip install --upgrade pip
      - run: pip install -r requirements.txt
      - run: bentoml build
      - run:
          name: Login to Docker Hub
          command: |
            echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
      - run:
          name: Build Docker image
          command: |
            bentoml containerize llm_service:latest -t $DOCKER_USERNAME/llm_service:latest
      - run:
          name: Push Docker image
          command: |
            docker push $DOCKER_USERNAME/llm_service:latest

workflows:
  version: 2
  test_build_deploy:
    jobs:
      - test_service
      - build_and_deploy_bento:
          requires:
            - test_service

This file defines your entire pipeline. At a high level, the pipeline consists of two jobs:

  • test_service is responsible for dependency installation and the execution of unit tests utilizing pytest. Its purpose is to validate any new code modifications, ensuring their correctness prior to progression in the deployment workflow. Should any tests fail, the pipeline’s execution is halted at this stage, adhering to robust continuous integration principles.
  • build_and_deploy_bento performs the core deployment operations. It constructs the Bento bundle, subsequently containerizing it into a Docker image. Following this, it authenticates with Docker Hub, tags the generated image appropriately, and pushes it to your designated Docker Hub repository.

The workflows section orchestrates the sequence of these jobs, enforcing that the build_and_deploy_bento job is initiated only upon the successful completion of test_service. This sequential dependency is a fundamental tenet of CI/CD, ensuring that only validated code proceeds to deployment.

By defining this pipeline, you automate the entire deployment process. Your Docker image on Docker Hub is now always up to date with your repository, ready to be deployed anywhere with a simple docker run.

Once your Docker image is pushed to Docker Hub, it becomes the single source of truth for your application. You can deploy it anywhere Docker runs like on a VM, Kubernetes cluster, or a serverless container service like AWS Fargate or Google Cloud Run. Pull the latest image from your Docker Hub repository and start a container.

Many modern deployment platforms, such as Kubernetes with an imagePullPolicy, or CI-integrated tools like ArgoCD or Flux, can automatically watch your repository for new image tags and restart the service whenever a new image is available. This ensures that your production environment stays in sync with your latest tested and deployed code, giving you a seamless, automated deployment pipeline.

Setting up your project on CircleCI

Before you can set up a project on CircleCI, you first need to upload your code to GitHub. Create a new file named .gitignore in the project root directory. The content you will add defines files and folders that should not be pushed to GitHub. Add this content:

.env
__pycache__
.DS_Store
venv
.pytest_cache

You can now push your code to a GitHub repository.

Next, log into your CircleCI account and create a new project. Before you can trigger the pipeline, you need to configure environment variables. On the CircleCI sidebar, click Projects. Click the ellipsis in your projects row, then click Project Settings.

Opening project settings

On the project settings page, click Environment Variables and add these environment variables:

  • OPENAI_API_KEY
  • DOCKER_USERNAME
  • DOCKER_PASSWORD.

Adding environment variables

You can now trigger the pipeline manually. It should execute successfully.

CircleCI successful execution

Conclusion

The pipeline you created in this guide forms the basis of a reliable and efficient way to deploy LLM services. By combining BentoML with CircleCI, you established a clear separation between application logic, and automated testing, packaging, and deployment using the pipeline. As your team and application scale, these logical divisions will be beneficial, especially with deployment of production-grade LLM services that may see significant traffic.

You can feel confident using this pipeline “as-is” to deploy your internal tools/chatbots/APIs powered by LLMs. Every deployment version goes through some level of testing and is deployed consistently as a container. When your use cases become more complicated, you can extend the pipeline to add more complex deployment strategies, such as blue/green or canary deployments. These strategies can minimize downtime and risks during deploying updates. In addition to the pipeline, you can also develop rollback capabilities if a deployment fails, or utilize A/B testing and deploy different versions of the same model in production.

For large-scale deployments, this pattern will fit into any orchestration platform, like Kubernetes or managed platforms that can pull updated images and manage service restarts. You can also create enrichment steps in the pipeline to scan images for vulnerabilities, enforce code quality, or trigger downstream workflows.

This CI/CD workflow provides a starting point, and is reusable for future projects. You can access the full code for this project on GitHub.


Muhammad Arham is a Deep Learning Engineer specializing in Computer Vision and Natural Language Processing, with experience developing globally top-ranking generative AI applications for the Google Play Store. He is passionate about constructing and optimizing machine learning models for intelligent systems, and firmly believes in continuous improvement within this rapidly evolving field.