TutorialsLast Updated Nov 28, 20238 min read

Machine learning CI/CD with AWS SageMaker

Senior Data Scientist

There are many benefits of incorporating CI/CD into your ML pipeline, such as automating the deployment of ML models to production at scale.

The focus of this article is to illustrate how to integrate AWS SageMaker model training and deployment into CircleCI CI/CD pipelines. The structure of this project is a monorepo containing multiple models. The monorepo approach has advantages over the polyrepo approach, including simplified dependency versioning and security vulnerability management.

You can find the code for this tutorial in this GitHub repository.

Orchestrate SageMaker deployments
with the SageMaker orb

Learn More

What is CI/CD?

CI/CD stands for Continuous Integration/Continuous Delivery. Its goal is to maximize developer efficiency by automating the process of shipping code from commit to production. Applied to an ML pipeline, it helps data scientists focus on working with data and building models, rather than on putting models into production and deployment infrastructure. The value of CI/CD becomes more apparent as the system’s complexity increases. A team with limited resources managing multiple models being served to various parts of the organization can save a lot of time using CI/CD automation practices.

Environment variables

The first step is to set up AWS credentials in your project on CircleCI. Go to your project’s settings, click Environment Variables, then click the Add Variable button to enter a name and value of the new environment variable. Later on you can pick up these environment variables in your Python script using os.environ.

In this sample project, we’ve stored AWS access keys and a SageMaker execution role ARN in our environment variables. It should be noted that boto3 automatically retrieves the environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY when we create a boto3 session, so don’t rename them to anything else.

Secrets saved on CircleCI

We store environment variables specific to a CI/CD job by declaring them with the environment key in the config file. This is not strictly necessary for this tutorial, but it may be helpful to know.

environment:
  MODEL_NAME: abalone-model
  MODEL_DESC: abalone model description text

In our Python scripts, we retrieve those environment variables:

model_name = os.environ["MODEL_NAME"] model_description = os.environ["MODEL_DESC"] role_arn = os.environ["SAGEMAKER_EXECUTION_ROLE_ARN"]

Models

For this tutorial, we’ve taken two models commonly found in AWS documentation, Abalone and Churn.

The Churn dataset is a synthetic dataset from a telecommunications mobile phone carrier, and a description of the dataset can be found in the Sagemaker documentation. We will attempt to predict the “Churn?” variable, which takes a binary true/false value. Because the target variable is binary, our model will be performing binary prediction. We will use the XGBoost framework to create a binary classifier model.

The Abalone dataset originates from the UCI data repository, and more details about the original dataset can be found here. We will use physical measurements from the dataset to predict the age of the abalone. Since age can take on a range of numerical values, we will perform regression to make our predictions. Once again, we use the XGBoost framework, but this time we create a regressor model.

Each model is contained in its own folder, and each folder contains three files:

gather_data.py

This file downloads and preprocesses the data for its model, then uploads the data to S3. We upload the train and validation datasets in separate folders, as is required by SageMaker.

# Upload training and validation data to S3
csv_buffer = io.BytesIO()
train_data.to_csv(csv_buffer, index=False)
s3_client.put_object(Bucket=bucket, Body=csv_buffer.getvalue(), Key=f"{model_name}/train/train.csv")

csv_buffer = io.BytesIO()
validation_data.to_csv(csv_buffer, index=False)
s3_client.put_object(Bucket=bucket, Body=csv_buffer.getvalue(), Key=f"{model_name}/validation/validation.csv")

train_register.py

This file trains our model then registers it with the model registry, and is the first place in our CI/CD pipeline where we actually make use of Sagemaker.

When configuring the SageMaker XGBoost Estimator, we can specify the S3 path to output the model artifacts using output_path. We are executing this code outside of SageMaker notebooks, so we must provide it the SageMaker session and a SageMaker execution role ARN. These steps ensure that it will not automatically retrieve that information correctly.

# Configure training estimator
xgb_estimator = Estimator(
    base_job_name = model_name,
    image_uri = image_uri,
    instance_type = "ml.m5.large",
    instance_count = 1,
    output_path = model_location,
    sagemaker_session = sagemaker_session,
    role = role_arn,
    hyperparameters = {
        "objective": "reg:linear",
        "max_depth": 5,
        "eta": 0.2,
        "gamma": 4,
        "min_child_weight": 6,
        "subsample": 0.7,
        "verbosity": 2,
        "num_round": 50,
    }
)

After training the model, we push the model package to the SageMaker model registry. SageMaker differentiates between a model and a model package. A model is just the object that we would deploy to an endpoint and run inference. The model package contains all the artifacts associated with that model, such as model weights, evaluation results, and configuration files. We push model packages to the model registry, not models.

We want to make use of a model registry so that we can easily refer to trained models in subsequent steps, such as deployment, or when we want to roll back models to previous versions. Notice that we pre-approve the model package so that we can use CircleCI approval jobs to manage model approval.

# Retrieve model artifacts from training job
model_artifacts = xgb_estimator.model_data

# Create pre-approved cross-account model package
create_model_package_input_dict = {
    "ModelPackageGroupName": model_name,
    "ModelPackageDescription": "",
    "ModelApprovalStatus": "Approved",
    "InferenceSpecification": {
        "Containers": [
            {
                "Image": image_uri,
                "ModelDataUrl": model_artifacts
            }
        ],
        "SupportedContentTypes": [ "text/csv" ],
        "SupportedResponseMIMETypes": [ "text/csv" ]
    }
}

create_model_package_response = sagemaker_client.create_model_package(**create_model_package_input_dict)

deploy.py

This file deploys the latest approved model package to the model endpoint, either by creating the endpoint if it does not already exist or updating an existing endpoint.

To get the latest approved model package, we use a SageMaker function to list existing model packages by descending creation time and retrieve the model package ARN:

# Get the latest approved model package of the model group in question
model_package_arn = sagemaker_client.list_model_packages(
    ModelPackageGroupName = model_name,
    ModelApprovalStatus = "Approved",
    SortBy = "CreationTime",
    SortOrder = "Descending"
)['ModelPackageSummaryList'][0]['ModelPackageArn']

Then we create a model out of the model package:

# Create the model
timed_model_name = f"{model_name}-{current_time}"
container_list = [{"ModelPackageName": model_package_arn}]

create_model_response = sagemaker_client.create_model(
    ModelName = timed_model_name,
    ExecutionRoleArn = role_arn,
    Containers = container_list
)

And create an endpoint config using that model:

# Create endpoint config
create_endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName = timed_model_name,
    ProductionVariants = [
        {
            "InstanceType": endpoint_instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": endpoint_instance_count,
            "ModelName": timed_model_name,
            "VariantName": "AllTraffic",
        }
    ]
)

Finally, we update the endpoint with the new config:

create_update_endpoint_response = sagemaker_client.update_endpoint(
    EndpointName = model_name,
    EndpointConfigName = timed_model_name
)

Dynamic configuration

We’ve taken a monorepo approach, so we need a way to run our CI/CD only for the model that has been changed. Otherwise, when we merge changes to the Abalone model, the Churn model will also retrain and redeploy. This is where CircleCI’s dynamic configurations come in handy. Dynamic config allows us to detect whether changes have been made to a particular folder, and if so, set the value of a pipeline parameter. In turn, the pipeline parameter will determine which workflows will run in our CI/CD pipeline.

There are a few steps required to enable dynamic configs in your project. A step-by-step guide is available in the documentation.

Setup configuration

The first step in making use of dynamic configs is the setup config. In our example repository, it is named config.yml. We use the path-filtering orb to identify which folders contain code changes.

Note that we compare files to those on the main branch and map changes in specific folders to parameter values. For example, if there are changes detected in the abalone_model folder, then the pipeline parameter deploy-abalone will be set to true. We also specify the path of the configuration file to trigger once path filtering and pipeline parameter value updates are complete.

base-revision: main
mapping: |
  abalone_model/.* deploy-abalone true
  churn_model/.* deploy-churn true
config-path: ".circleci/dynamic_config.yml"

Continue configuration

With the pipeline parameter values updated from the setup config, we now run the continue config, which in our example repository is named dynamic_config.yml. To make it easier to understand what the config file is doing, let’s focus on the abalone-model workflow.

workflows:
  abalone-model:
    when: << pipeline.parameters.deploy-abalone >>
    jobs:
      - abalone-model-train:
          filters:
            branches:
              ignore:
                - main
      - request-deployment:
          type: approval
          filters:
            branches:
              ignore:
                - main
          requires:
            - abalone-model-train
      - abalone-model-deploy:
          filters:
            branches:
              only:
                - main

This workflow will run only when the pipeline parameter deploy-abalone is true. Next, we run the job abalone-model-train, which executes the train_register.py file.

Then we trigger request-deployment, which is an approval job that requires the user to manually approve on CircleCI before the workflow can proceed. This is the point when a reviewer checks the model evaluation metrics on SageMaker before allowing the model to be deployed to the endpoint. If approval is given, the abalone-model-deploy job executes deploy.py.

Note that the training and approval jobs ignore the main branch, whereas the deploy job happens only on the main branch. This allows new model versions to be trained when the developer is working on updates to the model on a developer branch without triggering any sort of deployment. Then, once the code changes are accepted and merged into main, the deployment job gets triggered without triggering any further retraining of the model.

Pipelines on CircleCI

When code changes are pushed to the Abalone model in a developer branch on CircleCI, dynamic configuration has selectively run only the abalone training pipeline. The request-deployment approval job “gatekeeps” the code changes from being merged into main. Once it is approved, then the PR on GitHub can be merged.

Only the training pipeline is run when on a developer branch

Once code changes are merged to the main branch, dynamic configuration selectively runs only the abalone deployment pipeline.

Only the deployment pipeline is run when on the main branch

Conclusion

In this tutorial we’ve demonstrated the use of CircleCI along with AWS SageMaker to create an end-to-end machine learning pipeline. It automates the process of training a model and deploying it to an endpoint for real-time inference.

The tutorial project uses a monorepo setup where each model is contained in its own folder. The project also uses CircleCI’s dynamic configs to adapt each pipeline to the model that experiences code changes.

The code in this tutorial is publicly available in this GitHub repository.

CircleCI can help you and your team reach your machine learning goals faster. Explore the many tools CircleCI offers to support your ML model training.

To get updates on CircleCI’s AI roadmap and early access to new features that accelerate your AI and ML projects, sign up for the waitlist.