CD for machine learning: Deploy, monitor, retrain
Senior Technical Content Marketing Manager
While there are an increasing number of off-the-shelf machine learning (ML) solutions that promise to adapt to your specific requirements, organizations that are serious about investing in ML for the long term are building their own workflows tailored exactly to their data and the outcomes they expect. To make full use of this investment, ML models must be kept up to date and working from the freshest available data.
The first part of this tutorial showed how an ML workflow can be broken down into steps and automated using CircleCI to create a continuous integration (CI) pipeline that builds, trains, tests, and packages ML models. This tutorial will complete this ML workflow by adding deployment and retraining steps for continuous deployment (CD). It will then explain how each stage in ML workflows can be monitored using CircleCI so that the entire process can be observed by the relevant stakeholders, model updates can be approved, and problems can be resolved quickly using CircleCI’s web console.
What is CD, and what can it do for your ML models and workflows?
Continuous integration and continuous deployment (CI/CD) platforms are used in software development to automate the testing, building, and deployment of committed code, increasing development speed and ensuring product quality. As ML models are software, CI/CD platforms are ideal for increasing the speed of model training and deployment while ensuring that they remain accurate.
MLOps leverages automated testing techniques to automatically retrain and test the accuracy of ML models. As new data arrives or models are updated, the models are automatically rebuilt or retrained and then tested to ensure that they meet specifications before they are deployed. Scheduled tasks can be run to test data and re-validate models, further reducing the labor required by engineers.
This frees your ML experts from manually running pipelines, testing data, and deploying vetted models. They can focus on building more accurate models instead of manually managing the retraining and testing of existing models as new data is ingested.
Another benefit of automating your ML workflows is future-proofing. It’s not just data that will be regularly refreshed — new ML tools and methodologies are constantly emerging, and automating their adoption and testing helps get them up and running much faster. CircleCI allows you to run your ML pipelines either locally or in managed cloud environments, including GPU-enabled environments. This can greatly speed up ML training tasks and ensures that you have the infrastructure available to use the latest and greatest ML toolchains without having to manage any infrastructure.
What we’ve built so far: automated building, testing, training, and packaging with a CI workflow
The first part of this tutorial broke down a simple ML workflow and mapped out the continuous integration (CI) steps — building and training the ML model and then testing and packaging it for use — and demonstrated automating them using CircleCI. This article outlines the continuous deployment (CD) steps — rebuilding the ML model when the ML code or input data changes, retraining it, and packaging it for later use — and shows how they can be implemented in a CircleCI automated workflow.
Adding deployment and retraining to your ML workflow
Below, you will see how to add deployment steps to this workflow and create a separate retraining workflow using the example ML workflow in the example repository.
This tutorial is focused on how you can break down your ML workflows and automate them with CircleCI, not how to build an ML model. The ML workflow used in the example was taken from TensorFlow’s documentation and then broken down into individual Python scripts for each step.
Prerequisites and set up
In the example ML repository, the 4_package.py script uploads the trained and packaged model to a server via SSH. This was the last step demonstrated in part 1 of this tutorial. Below, we will deploy the packaged model to a TensorFlow Serving server. To keep things simple, we’ll assume that this server is running in a Docker container on the same host that we uploaded the packaged models to in part 1.
Setting up Tensorflow Serving
A Bash script is supplied for spinning up a Docker container running TensorFlow Serving for testing:
bash ./tools/install_server.sh
Note that you will first need to install Docker according to its installation instructions for your platform.
The ML deployment and retraining Python scripts will use the same SSH credentials from part 1 that were used to upload the packaged models. These credentials are stored as CircleCI environment variables and written to the .env
configuration. They will be used to interact with Docker on the deployment server.
Deploying packaged models
Once a deployment environment has been set up, packaged models can be deployed and used. This stage involves deploying your trained and packaged model to your production ML environment.
In the example repository, the packaged model is uploaded to the directory TensorFlow Serving loads its models from.
The Python code for this step is in ml/5_deploy.py.
Testing the deployed model
Ensuring a successful deployment is important to prevent downtime, so this example makes a quick HTTP POST request to TensorFlow Serving to ensure that it receives a response.
If the request is unsuccessful, the resulting error will be thrown by the Python script.
The Python code for this step is in ml/7_test_deployed_model.py.
Retraining deployed models with new data
ML is not a “one and done” task. Whether you are analyzing customer data or user behavior or modeling for scientific purposes, when new, high-quality data arrives, you will want to retrain your existing models so that you are not limited to just your initial data set.
Once your new data has been ingested and validated, you need to retest retrained models using known data to ensure they remain accurate.
While our example won’t be able to load any fresh data, we can still provide a file and pipeline for retraining and retesting.
The Python code for retraining and testing the retrained data is located in the file 6_retrain.py.
Note: In this script, the testing step is designed to fail! This is so that you can see what a failed job looks like when this script is added to a job in CircleCI.
Automating ML model deployment with CircleCI CI/CD
As CircleCI pipelines are all defined as code in the CircleCI configuration file, steps, jobs, and workflows can be added and modified and then committed to version control for testing and deployment.
Following from the first part of this tutorial, the deployment and retraining Python scripts can be defined as jobs and then added to workflows.
Adding deployment and retraining to the CircleCI configuration
Below, deploy
and test-deployment
steps are added to the existing build-deploy workflow to be run after the package
step:
# Do not deploy without manual approval - you can inspect the console output from training and make sure you are happy to deploy
- deploy:
requires:
- package
- test-deployment:
requires:
- deploy
A retrain-deploy
workflow has also been defined to include the new scripts. In this example, it is triggered according to a schedule defined using cron syntax. To see this scheduled workflow in action, you will need to create a branch in your Git repository named retrain
.
retrain-deploy:
# Trigger on a schedule or when retrain branch is updated
triggers:
- schedule:
cron: "0 0 * * *" # Daily
filters:
branches:
only:
- retrain
jobs:
- install-build
- retrain:
requires:
- install-build
# Do not redeploy without manual approval - you can inspect the console output from training and make sure you are happy to deploy the retrained model
- hold: # A job that will require manual approval in the CircleCI web application.
requires:
- retrain
type: approval # This key-value pair will set your workflow to a status of "On Hold"
- package:
requires:
- hold
- deploy:
requires:
- package
- test-deployment:
requires:
- deploy
In the retrain-deploy
pipeline, a hold step has been added between the retrain
and package
steps. Pipeline execution will pause here until approval to proceed is given in the CircleCI web console, which is highly useful in ML pipelines where the accuracy of a retrained model needs to be verified before it is used.
The on_fail
condition is demonstrated within the retrain
job that is called in this workflow. This allows you to take specific actions when a job fails:
- run:
# You could trigger custom notifications here so that the person responsible for a particular job is notified via email, Slack, etc.
name: Run on fail status
command: |
echo "I am the result of the above failed job"
when: on_fail
By ensuring that your ML scripts are verbose, you can make sure that the user has the information they require to confirm that a model is ready for use. By throwing exceptions when retraining conditions are not met, pipelines can be halted entirely so that problems can be rectified before they reach production.
Scheduling, branches, and manual pipeline execution
As shown in the code above, the retrain-deploy
pipeline is run according to a user-defined schedule. This differs from the build-deploy
pipeline from part 1, which only runs when a specified branch is updated.
You can also manually trigger a pipeline at any time from the CircleCI web console, rerun failed jobs, or trigger a pipeline to run using the CircleCI API. Using the CircleCI API, you can set up your data ingestion tools to trigger a CircleCI pipeline externally when new data has arrived.
Build it how you want it
CircleCI can do a lot more than we can demonstrate in the space of this tutorial. It provides a flexible platform for accomplishing fully bespoke ML workflows — as each job is a console command, you can script almost any behavior using any language, platform, or library.
The concepts and features implemented by CircleCI provide the tools to build almost any kind of ML workflow. But if there’s something it doesn’t do, it’s simply a matter of scripting that behavior yourself as a reusable command or job and adding it to your CircleCI configuration file.
Using CircleCI to monitor your ML CI/CD pipelines
Once your CircleCI configuration has been committed to your Git repository, CircleCI will execute the workflows defined in it based on the defined filters and schedules. You will be able to see the output of tasks undertaken by CircleCI in the web console.
After clicking on a job, you can see the console output it produced while running. Note that an exit code of zero means everything worked as expected, whereas a non-zero exit code usually indicates failure.
Jobs can be held for approval, and if a job fails, you can rapidly respond and confirm the issue in the CircleCI UI by rerunning only the failed parts of your workflow.
Rerunning a failed job saves time, allowing you to resume from the point of failure rather than running the whole workflow again.
Jobs can be held for approval so that the output logged to CircleCI can be inspected before the pipeline resumes. This allows you to output and review the results of ML build and test jobs and only progress to packaging and deploying if they meet requirements.
As your ML requirements and workflows expand, you can offload the increasing number of ML management and monitoring tasks to scripts triggered by CircleCI pipelines. This way, your workload will be significantly reduced, automated tasks will run and complete as data arrives or on a schedule, and you’ll only have to take action when there’s a problem. Automation significantly reduces the amount of time your team spends operating and monitoring your ML systems, freeing you to spend more time building and less time on administrative overheads.
Customizing notifications
By default, you will receive notifications on job failures and required approvals to your default CircleCI email address. Additionally, you can configure other notifications behavior, including adding other team members to receive notifications on your pipelines, setting up web notifications, and connecting your CircleCI pipeline to Slack or IRC.
By customizing your notifications, you can make sure that the right person is notified to fix a failed job and that your ML system stays accurate and available.
Responding to problems in production
Once your model is deployed, monitoring and logging will be handled by your ML platform. You can see how this is configured in TensorFlow in this guide.
Using the CircleCI API, you can configure your production monitoring tools to trigger CircleCI pipelines to run so that you can roll back, retrain, or redeploy models to rapidly respond to incidents.
Using cloud-hosted GPU classes and leveraging on-site processing power
Retraining ML models is a compute-intensive task. Leveraging GPU processing power will greatly speed up the process, allowing you to retrain faster or train and test multiple datasets in parallel. CircleCI’s GPU classes make short work of ML tasks and can be added to your CircleCI configuration for instant access to cloud GPU resources without having to set up and maintain any local compute resources.
By combining local runners with cloud GPU resources and CircleCI’s workspace functionality, you can access and prepare your ML training data on-site and then use it in the cloud without having to set up complex infrastructure for granting cloud resources access to your internal data stores. Conversely, if you are concerned about your cloud-compute costs, you can move data from the cloud on-site and execute your ML tasks on local GPUs using CircleCI’s self-hosted runners.
Automated machine learning with CI/CD keeps your data fresh
Organizations slow to adopt ML best practices, especially in CI/CD, are not fully leveraging their data investments. The code behind your ML models must be regularly updated and integrated using CI best practices. Models in production need to be constantly retrained on the freshest high-quality data available to prevent their outcomes from becoming stale.
Using CI/CD tools to automate your ML workflows using MLOps strategies results in a streamlined process in which data is quickly consumed and validated, models are constantly fed fresh information to learn from, and problems are quickly resolved.
CircleCI is the leading CI/CD platform for managing and automating ML workflows. It takes minutes to set up, and you can evaluate it at no cost, with up to 6,000 minutes of free usage per month.