With automation and CI/CD practices, the entire AI workflow can be run and monitored efficiently, often by a single expert. Still, running AI/ML on GPU instances has its challenges. This tutorial shows you how to meet those challenges using the control and flexibility of CircleCI runners combined with Scaleway, a powerful cloud ecosystem for building, training, and deploying applications at scale. We will also demonstrate the cost effectiveness of ephemeral runners that consume only the resources required for AL/ML work.

Prerequisites

This tutorial builds on an MLOps pipeline example first introduced in our blog series on CI and CD for machine learning. We recommend you read through these first to gain a better understanding of the project we will be working with.

The repository is slightly changed from the previous examples to better work with the specific Scaleway cloud provider and Pulumi.

If you don’t have them yet, you will need to create CircleCI, Scaleway, and Pulumi accounts.

Scaleway offers 100 EUR credit for newly created accounts, which will come in handy when trying it out. This is also valid credit for certain GPU instances which the demo project uses, provided you verify your identity and payment method following their instructions.

For CircleCI we will be using the free tier. Same for Pulumi.

For CircleCI, make sure you are an admin in the organization you are using for the project. Configuring new runner namespaces will require admin access.

Note: This is an advanced tutorial and not aimed at beginners, and assumes a level of familiarity with CircleCI, CI/CD concepts, and infrastructure as code.

High-level project flow

As the pipeline starts, we first provision our environment and create a new runner resource class in CircleCI. This is how CircleCI will communicate with our Scaleway infrastructure.

We will then use Pulumi to provision two GPU instances on Scaleway:

  • One to host our CircleCI runner and act as our CI/CD agent for training and deploying the model
  • Another to act as a model server, to which we will deploy our trained models

Note: In real-world production scenarios, you would likely not provision the model serving instance from the same pipeline, as you would need it to be permanent rather than ephemeral.

After everything is provisioned, we will install required dependencies, train, test, and deploy our model, executed on your newly provisioned CircleCI runner. These jobs are thoroughly covered in the CI/CD for ML blog post series, so we will be merely glancing at them here.

Finally, the resources are cleaned up, and the newly created CircleCI runner is removed so the pipeline can run again.

Walkthrough and project setup

We recommend you fork the sample repository and continue from there.

Once you have a project’s fork in your GitHub repository, you can set it up on CircleCI as a new project.

This guide will show you how to get started, and then walk you through various files that comprise the pipeline.

Preparing environment variables

We need a number of secrets and environment variables set up before the pipeline can be run. The secrets are split logically into four contexts.

First, create a new CircleCI API key and store it in a CircleCI context named circleci-api as CIRCLECI_CLI_TOKEN. This is used to provision new runners from within the pipeline.

Next, create a new Pulumi access token and store it in a context named pulumi as PULUMI_ACCESS_TOKEN.

Next, create a Scaleway access key. You will generate two values: access key ID and secret key. Create a scaleway context and store them into SCW_ACCESS_KEY and SCW_SECRET_KEY, respectively.

Finally, create a context ml-scaleway-demo and populate it with three environment variables:

  1. DEPLOY_SERVER_USERNAME (we use demo)
  2. DEPLOY_SERVER_PASSWORD (we use demodemo)
  3. DEPLOY_SERVER_PATH as /var/models

Pulumi project and stack setup

Pulumi is the tool that will help us provision infrastructure. It offers an SDK-based approach to infrastructure provisioning for many different programming languages. This allows us to use Python, like in the rest of the AI/ML scripts.

In Pulumi you will need to create a new project and stack. Stack corresponds to individual applications in a Pulumi project. Our project is located in the org zmarkan-demos, and is named cci-ml-runner. It contains one stack, cci-runner-linux.

Files for Pulumi are located in pulumi, and you might want to modify them with your preferred project and stack names, as well as your Scaleway configuration.

Scaleway project setup

In Scaleway we also created a new project named cci-ml-runner. Go to Project Settings, copy the Project ID (it should be in the UUID format) and paste it in the file pulumi/Pulumi.cci-runner-linux.yaml where you see scaleway:project_id: Leave the rest of the file unchanged — we need this specific region and zone combination to use the GPU resources.

config:
 scaleway:project_id: YOUR_PROJECT_ID_UUID
 scaleway:region: fr-par
 scaleway:zone: fr-par-2

To use Scaleway instances from the local command line, you might also want to pass in your SSH key. Without it you won’t be able to SSH into any of the instances to debug or inspect them.

Setting up the CircleCI pipeline

In our pipeline we have one workflow of interest: build-deploy. It has 12 jobs, from provisioning the runner and infrastructure to the end.

Let’s looks at the first job: provision_runner

Provision Runner

This job does most of the heavy lifting for provisioning cloud infrastructure and configuring the runner. It does it all in an automated way so that instances are truly ephemeral and consume only the resources required for the rest of the AI/ML work.

The job runs in a standard CircleCI Docker executor:

jobs:
 provision_runner:
   docker:
     - image: cimg/python:3.11

We first install the CircleCI CLI, which will help us create new runner resource classes. The CLI uses the environment variable created earlier to authenticate.

- run:
       name: Install CircleCI CLI
       command: |
         # Make CircleCI CLI available at /usr/local/bin/circleci
         curl -fLSs https://raw.githubusercontent.com/CircleCI-Public/circleci-cli/main/install.sh | sudo bash

Then we provision a new runner resource class using the CLI and pass it to the cloud-init script used by Pulumi for provisioning:

- run:
         name: Provision new runner and prepare cloud-init file
         command: |
           runner_token_response=$(/usr/local/bin/circleci runner resource-class create zans-cci-org/scaleway-linux-<<pipeline.number>> "Autoprovisioned Linux runner on Scaleway" --generate-token)
           export runner_token=$(echo $runner_token_response | grep "auth_token:" | awk '{print $3}')
           sed "s/RUNNER_TOKEN/${runner_token}/g" pulumi/runner_cloud_init_base.yml > pulumi/runner_cloud_init.yml

To break it down line by line:

runner_token_response=$(/usr/local/bin/circleci runner resource-class create zans-cci-org/scaleway-linux-<<pipeline.number>> "Autoprovisioned Linux runner on Scaleway" --generate-token)

This runs the circleci runner resource-class create command to create your runner’s resource class. In the example, we use the zans-cci-org namespace, but you should create your own.

Note: Remember that once a namespace is created it cannot be changed and must always be used for all runner creation. For eternity.

The full runner name is zans-cci-org/scaleway-linux<<pipeline.number>> which uses the pipeline variable to inject a numeric value into the resource class for uniqueness.

We also give it a label: “Autoprovisioned Linux runner on Scaleway”. This can be changed to anything you like.

Finally, the --generate-token flag ensures that we create a new token for our runner’s resource class.

The whole command is wrapped in a variable allocation and the result is stored in runner_token_response to be used in subsequent line of the script.

export runner_token=$(echo $runner_token_response | grep "auth_token:" | awk '{print $3}')

We use this to extract just the token value into $runner_token variable, by selecting the auth_token field and selecting the token’s value using awk.

Tip: If you are struggling to figure out the right command to parse various values from responses, feed the whole response into a LLM and ask it for a command to extract it.

The final line injects the value of the runner resource class token into our cloud-init template to be used by Pulumi.

sed "s/RUNNER_TOKEN/${runner_token}/g" pulumi/runner_cloud_init_base.yml > pulumi/runner_cloud_init.yml

In pulumi/runner_cloud_init_base.yml we have a stub for the token, which we are replacing using sed. This is a practical way for templating in CI/CD pipelines.

Next, we run the two Pulumi commands to provision our infrastructure:

     - pulumi/login
     - pulumi/update:
         stack: zmarkan-demos/cci-ml-runner/cci-runner-linux
         working_directory: pulumi

These commands use the Pulumi orb for CircleCI, which is declared at the top of the .circleci/config.yml file:

orbs:
 pulumi: pulumi/pulumi@2.1.0

pulumi/login authenticates with the authentication token we provided as a secret, and pulumi/update runs the actual provisioning for the specified stack.

Make sure to change the stack values accordingly to reflect your specific organization, project, and stack names.

After the Pulumi update is run, we store the IP address of the created model server from Pulumi output and into a .env file as a variable DEPLOY_SERVER_HOSTNAME. This will be stored in our CircleCI workspace to be shared with subsequent jobs.

- run:
         name: Store model server IP to workspace
         command: |
           mkdir workspace
           echo "DEPLOY_SERVER_HOSTNAME=$(pulumi stack output modelserver_ip  --cwd pulumi --stack  zmarkan-demos/cci-ml-runner/cci-runner-linux)" > workspace/.env
     - persist_to_workspace:
         root: workspace
         paths:
           - .env

For that we need the pulumi stack output command, passing in the output of our modelserver_ip variable created by Pulumi. We need to pass it the correct working directory for Pulumi (--cwd pulumi), and the correct stack value (--stack zmarkan-demos/cci-ml-runner/cci-runner-linux). Everything is then outputted to workspace/.env and the file is stored in a workspace using persist_to_workspace command.

Provision runner

That’s it for the provision_runner job. The whole job looks like this:

jobs:
 provision_runner:
   docker:
     - image: cimg/python:3.11

   steps:
     - checkout
     - run:
       name: Install CircleCI CLI
       command: |
         # Make CircleCI CLI available at /usr/local/bin/circleci
         curl -fLSs https://raw.githubusercontent.com/CircleCI-Public/circleci-cli/main/install.sh | sudo bash

     - run:
         name: Provision new runner and prepare cloud-init file
         command: |
           runner_token_response=$(/usr/local/bin/circleci runner resource-class create zans-cci-org/scaleway-linux-<<pipeline.number>> "Autoprovisioned Linux runner on Scaleway" --generate-token)
           export runner_token=$(echo $runner_token_response | grep "auth_token:" | awk '{print $3}')
           sed "s/RUNNER_TOKEN/${runner_token}/g" pulumi/runner_cloud_init_base.yml > pulumi/runner_cloud_init.yml

     - pulumi/login
     - pulumi/update:
         stack: zmarkan-demos/cci-ml-runner/cci-runner-linux
         working_directory: pulumi
     - run:
         name: Store model server IP to workspace
         command: |
           mkdir workspace
           echo "DEPLOY_SERVER_HOSTNAME=$(pulumi stack output modelserver_ip  --cwd pulumi --stack  zmarkan-demos/cci-ml-runner/cci-runner-linux)" > workspace/.env
     - persist_to_workspace:
         root: workspace
         paths:
           - .env

Pulumi provisioning scripts and Scaleway GPU resource configuration

Pulumi scripts live in the pulumi directory of the project. This is where we configure the resources. The files of note are:

  • Pulumi.yaml: Project and language configuration for the SDK.
  • Pulumi.cci-runner-linux.yaml: Configuration specific to our project, such as Scaleway project ID, region, and zone. We already populated that earlier with your project ID.
  • requirements.txt: Python’s dependency spec for Pulumi.
  • __main__.py: The main Pulumi script for declaring resource.
  • runner_cloud_init_base.yml: cloud-init template script for the CircleCI runner instance, executed at first boot to bootstrap it.
  • modelserver_cloud_init.yml: cloud-init script for the model server instance.

Scaleway instances

Looking at __main__.py, we have two resources created using scaleway.InstanceServer function: modelTrainingCCIRunner and tensorflowServer. Let’s break down the modelTrainingCCIRunner:

modelTrainingCCIRunner = scaleway.InstanceServer("runnerServerLinuxGPU",
   type="GPU-3070-S",
   image="ubuntu_jammy_gpu_os_12",
   ip_id=runnerPublicIp.id,
   root_volume=scaleway.InstanceServerRootVolumeArgs(
       size_in_gb=80,
       volume_type="b_ssd",
   ),
   user_data={
       "cloud-init": (lambda path: open(path).read())(f"runner_cloud_init.yml"),
   }
)

First we label it runnerServerLinuxGPU, which is how it will be named in the Scaleway dashboard.

Then, we declare the instance type and image to use.

In our case, we are using GPU-3070-S for type and ubuntu_jammy_gpu_os_12 as the image. Scaleway has a large variety of instance types. We use the smallest GPU on offer — GPU-3070-S with 8 vCPUs and 16GB RAM — but you could go all the way to the top of the range H100-2-80G with two H100 GPUs and 480 GB of RAM.

Scaleway’s GPU are optimized with tooling such as Cuda and Docker preconfigured, so they make it easy to start executing ML payloads.

We also need to pass in the volume (in our case that is 80 GB SSD block), and a cloud-init script as user_data.

This is the script that we populated with our CircleCI runner token earlier in the pipeline.

root_volume=scaleway.InstanceServerRootVolumeArgs(
       size_in_gb=80,
       volume_type="b_ssd",
   ),
   user_data={
       "cloud-init": (lambda path: open(path).read())(f"runner_cloud_init.yml"),
   }

The other instance we are configuring is our TensorFlow server, which will serve our models in production using a tensorflowServer Docker container. It is configured as follows:

tensorflowServer = scaleway.InstanceServer("tensorflowServerLinux",
   type="PRO2-M",
   image="ubuntu_jammy",
   ip_id=serverPublicIp.id,
   root_volume=scaleway.InstanceServerRootVolumeArgs(
       size_in_gb=40,
       volume_type="b_ssd",
   ),
   user_data={
       "cloud-init": (lambda path: open(path).read())(f"modelserver_cloud_init.yml")
   }
)

Note that the cloud-init script is different from the one in the above example (we will look at both of them shortly) and that we are using a CPU-based instance, not a GPU one. Ideally, model serving is also done on a GPU-based instance. However to keep this tutorial accessible we are using the Scaleway free plan, which has a limit of one concurrent GPU instance.

Finally, the IPs and IDs of both created instances are exported. We are only using modelserver_ip for the time being in the pipeline.

pulumi.export("cci_runner_ip", modelTrainingCCIRunner.public_ip)
pulumi.export("cci_runner_id", modelTrainingCCIRunner.id)
pulumi.export("modelserver_id", tensorflowServer.id)
pulumi.export("modelserver_ip", tensorflowServer.public_ip)

Now, let’s look at the two cloud-init scripts that bootstrap the instances. We will start with the runner_cloud_init_base.yml, which configures the CircleCI runner on the GPU instance:

#!/bin/sh
export runner_token="RUNNER_TOKEN"
echo "Runner token $runner_token"

# CircleCI Runner installation
curl -s https://packagecloud.io/install/repositories/circleci/runner/script.deb.sh?any=true | sudo bash

sudo apt-get install -y circleci-runner python3.10-venv

# Add CCI user to Docker
usermod -aG docker circleci

sudo sed -i "s/<< AUTH_TOKEN >>/$runner_token/g" /etc/circleci-runner/circleci-runner-config.yaml

# Prepare and start runner daemon
sudo systemctl enable circleci-runner && sudo systemctl start circleci-runner

They are both standard shell scripts that execute the configuration required by each instance at bootstrap.

This script installs the CircleCI runner, following the instructions for installing Machine Runner 3.0 as of November 2023.

We set up the apt repository for the runner installation script, install it, and add the newly created circleci user to the docker group.

We then pass the runner_token value (which was passed to this script from the pipeline when the resource class was provisioned) to the circleci-runner-config.yaml. This will allow it to communicate with CircleCI’s servers.

Finally, we start the runner daemon with sudo systemctl enable circleci-runner && sudo systemctl start circleci-runner.

Note: If you want to do any more runner setup in the cloud-config script, make sure to leave the instruction to start the runner as the last thing in the script. Once systemctl start circleci-runner gets called, the runner is available to any jobs requiring that resource class, and if an instance hasn’t finished setting up, you might encounter unexpected behavior.

Now, let’s look at the other cloud-init script: modelserver_cloud_init.yml.

This script installs Docker Engine, sets up the tensorflow-serving Docker container, and prepares our environment for pushing and deploying new models.

We also create a new SSH user demo, which will be used by the pipeline to push models to the server.

The installation for Docker Engine is taken from this article on Scaleway’s tutorials site, if you want more details.

To prepare the serving and upload directories we need to first create /var/models and set ownership to the docker group:

# Create the directories and grant permissions so that the user defined in the .env file and docker can read/write to them
sudo mkdir -p /var/models/staging # so that docker will have something to bind to, it will be populated later
sudo mkdir -p /var/models/prod
sudo chown -R $USER:docker /var/models
sudo chmod -R 775 /var/models

Next we create the demo user that will be used for uploading models from the pipeline:

# Create demo user for sftp upload
useradd demo
mkdir /home/demo
chown demo:demo /home/demo
# # set demo user password 
# Warning! hardcoded for demo purpose as this server is short-lived
echo 'demo:demodemo' | chpasswd

# add demo user to docker group
usermod -aG docker demo

# Allow SSH access via password auth
sed -i 's|PasswordAuthentication no|PasswordAuthentication yes|g' /etc/ssh/sshd_config
systemctl restart ssh

You might notice some hard-coded strings for password authentication in here. This is of course only for demo purposes, as the serving server is only live for the duration of the pipeline, and its IP is not known outside the project. In a production setting we would have this server up permanently, and enable access through more secure means, such as using SSH keys.

Finally, download the tensorflow_serving image and run the container:


# Download the TensorFlow Serving Docker image and repo
docker pull tensorflow/serving

# Create a TensorFlow Serving container with the directories configured for use with this example
docker run -d --name tensorflow_serving -p 8501:8501 -v /var/models/prod:/models/my_model -e MODEL_NAME=my_model tensorflow/serving

Now, our model server is ready for the newly trained models to be uploaded!

Using the new runner resource class to run AI/ML workloads in a CircleCI pipeline

After provisioning of cloud infrastructure and runner resources is complete, we can move on to the “fun stuff”. The subsequent jobs — install-build, train, test, package, deploy, and test-deployment — are based on our existing blog post. You can review that tutorial for more detail. This tutorial covers the differences between the two repositories that are specific to building on Scaleway’s GPU infrastructure.

In short, for each of the jobs, there is a corresponding Python script in the ml/ directory which has scripted the tasks for that segment of the pipeline: building the dataset, training, testing, and so on. Then, they use the CircleCI workspace to pass the intermediary artifacts between jobs in the pipeline.

In all the jobs we will use the newly created resource class zans-cci-org/scaleway_runner_linux-NUMBER. For ease of reuse, we have it declared as an executor in the config.yml:

executors:
   # One freshly baked runner, straight from the boulang... err, pipeline
 scaleway_runner_linux:
   machine: true
   resource_class: zans-cci-org/scaleway-linux-<< pipeline.number >>

The final name of the resource class, zans-cci-org/scaleway-linux-<< pipeline.number >> corresponds to the pipeline number that’s been run. For example, on the 120th run of our pipeline, the resource class would have 120 as a suffix. This ensures that our resources are always recreated.

All the jobs will declare the executor, which will then run it on our Scaleway instance.

install-build:
   executor: scaleway_runner_linux
   steps:
    …

When the model has been trained and tested, it’s time to package it and deploy. We will use our model server instance for that, passing the details to the newly created instance’s IP via workspace.

We already used the workspace to store our .env file containing DEPLOY_SERVER_HOSTNAME, and now we need to retrieve it in the package job.

We have a populate-env command which grabs it from the workspace, and makes it available to other commands in the job as an environment variable using $BASH_ENV file:

populate-env:
   steps:
     - attach_workspace:
         at: .
     - run:
         name: Restore secrets from workspace and add to environment vars
         # Environment variables must be configured in a CircleCI project or context
         command: |
           cat .env >> $BASH_ENV
           source $BASH_ENV

The ml/4_package.py script then takes the created model and uses SFTP to upload it to the model server’s staging directory.

This is performed again in deploy and test-deployment jobs, which also need access to the same IP address for their work.

Now, let’s have a look at the orchestrated build-deploy workflow with all the jobs:

workflows:
 # This workflow does a full build from scratch and deploys the model
 build-deploy:
   jobs:
     - provision_runner:
         context:
           - pulumi
           - scaleway
           - circleci-api
     - install-build:
         requires:
           - provision_runner
         context:
           - ml-scaleway-demo
     - train:
         requires: 
           - install-build
     - test:
         requires:
           - train
     - package:
         requires:
           - test
         context:
           - ml-scaleway-demo
     # Do not deploy without manual approval - you can inspect the console output from training and make sure you are happy to deploy
     - deploy:
         requires:
           - package
         context:
           - ml-scaleway-demo
     - test-deployment:
         requires:
           - deploy
         context:
           - ml-scaleway-demo
     - approve_destroy:
         type: approval
         requires:
           - test-deployment
     - destroy_runner:
         context:
           - pulumi
           - scaleway
           - circleci-api
         requires:
           - approve_destroy

In the workflow definition we orchestrate all the jobs in our pipeline, depending on what the conditions for execution are. We can also pass in the contexts with the right environment variables.

For example, our provision_runner job needs access to multiple environment variables to access Pulumi, Scaleway, and the CircleCI API keys, so we pass all three to it.

     - provision_runner:
         context:
           - pulumi
           - scaleway
           - circleci-api

Similarly, the deploy job needs access to the deployment server credentials and path, which are stored in the ml-scaleway-demo context:

- deploy:
         requires:
           - package
         context:
           - ml-scaleway-demo

The final jobs in the workflow are approve_destroy, which introduces a manual approval before we clean up our created infrastructure in destroy_runner:

- approve_destroy:
         type: approval
         requires:
           - test-deployment
     - destroy_runner:
         context:
           - pulumi
           - scaleway
           - circleci-api
         requires:
           - approve_destroy

Destroy runner and clean up the environment

Let’s look at the process to clean up the created infrastructure in the destroy_runner job:

 destroy_runner:
   docker:
     - image: cimg/python:3.11
   steps:
     - checkout
     - run:
         name: Install CircleCI CLI
         command: |
           # Make CircleCI CLI available at /usr/local/bin/circleci
           curl -fLSs https://raw.githubusercontent.com/CircleCI-Public/circleci-cli/main/install.sh | sudo bash

     - run:
         name: Remove Runner token and Resource class
         command: |
           runner_resource_class="zans-cci-org/scaleway-linux-<< pipeline.number >>"
           token_output=$(circleci runner token ls $runner_resource_class)
           echo $token_output

           # Grab UUID
           runner_token_id=$(echo $token_output | grep -o -E '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}')

           echo $runner_token_id

           /usr/local/bin/circleci runner token delete $runner_token_id
           /usr/local/bin/circleci runner resource-class delete $runner_resource_class

     - pulumi/login
     - pulumi/destroy:
         stack: zmarkan-demos/cci-ml-runner/cci-runner-linux
         working_directory: pulumi

Unsurprisingly, we are again using CircleCI’s hosted executors to run it. We are also installing the CircleCI CLI again.

To destroy the runner resource class, we need to know its name – zans-cci-org/scaleway-linux-<< pipeline.number >> in our case — and first identify its token and delete it.

token_output=$(circleci runner token ls $runner_resource_class)

           # Grab UUID
           runner_token_id=$(echo $token_output | grep -o -E '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}')

           echo $runner_token_id

           /usr/local/bin/circleci runner token delete $runner_token_id

We get the token list by running runner token ls $runner_resource_class and extract it using grep. The ID is in the UUID format, so a regular expression to capture UUID works well here.

Finally, we pass the token ID to runner token delete.

Once the corresponding token has been deleted, we can delete the resource class as well:

           /usr/local/bin/circleci runner resource-class delete $runner_resource_class

To destroy the corresponding infrastructure, we can use the Pulumi orb again and invoke the pulumi/destroy command, which does exactly opposite from pulumi/update we ran when provisioning.

    - pulumi/login
     - pulumi/destroy:
         stack: zmarkan-demos/cci-ml-runner/cci-runner-linux
         working_directory: pulumi

Destroy runner and clean up

Conclusion

This brings us to the end of our tutorial. We covered the intricacies of using CircleCI runners to execute AI/ML workloads on your own infrastructure. In our example we used Scaleway cloud and its optimized GPU compute instances, however the principles covered can be applied to any type of infrastructure. We also covered utilizing the flexibility that the cloud offers to use resources only when needed. That is common practice for a CI/CD pipeline with a clearly defined beginning and end, but with more traditional infrastructure it is not so common. By leveraging as-needed cloud resources, CircleCI can help you manage your AI/ML workflows more efficiently.

CircleCI is the leading CI/CD platform for managing and automating AI workflows. It takes minutes to set up, and you can evaluate it at no cost, with up to 6,000 minutes of free usage per month. Sign up for your free account to get started today.