Amid an AI boom and developing research, machine learning (ML) models such as OpenAI’s ChatGPT and Midjourney’s generative text-to-image model have radically shifted the natural language processing (NLP) and image processing landscape. With this new and powerful technology, developing and deploying ML models has quickly become the new frontier for software development.
The shift toward an ML-powered technology stack introduces new challenges for developing and deploying performant, cost-effective software. These challenges include managing compute resources, testing and monitoring, and enabling automated deployment.
Modern software development has embraced continuous integration and continuous deployment (CI/CD) to solve similar difficulties with traditional technology stacks. And while CI/CD manages the complexity of ML-powered solutions effectively, accommodating the new ML domain requires adjusting traditional approaches.
Understanding the challenges of ML model development
This article identifies seven key challenges of developing and deploying ML models and how to overcome them with CI/CD. You will explore how CircleCI’s comprehensive platform can jumpstart your ML solutions and prepare them for production.
- Scalability and compute resource management
- Reproducibility and environment consistency
- Testing and validation
- Security and compliance
- Deployment automation
- Monitoring and performance analysis
- Continuous training
Challenge 1: Scalability and compute resource management
One of the main challenges that ML developers face is the intensive compute requirements for building and training large-scale ML models. Indeed, training large language models (LLMs) like ChatGPT typically consumes billions of input words and costs millions of dollars in computational resources.
Because of the scale needed to train and develop these models, analysts have proposed cloud computing to meet the computational demand. However, using GPU or CPU resources from popular cloud services — such as Amazon Web Services (AWS) and Google Cloud Platform (GCP) — for extended training tasks is costly. Moreover, cloud providers’ “unlimited” scaling offerings can lead to runaway resource usage and associated costs.
CircleCI’s features enable scaling while controlling and monitoring costs. For training models in the cloud, CircleCI offers several tiers of GPU resource classes with transparent pricing models. Alternatively, self-hosted runners enable CI/CD jobs to run on a private cloud or on-premises for more flexibility. With this extra versatility, you can configure self-hosted runners to scale automatically or execute jobs concurrently.
Challenge 2: Reproducibility and environment consistency
Another critical aspect of managing ML model deployment is maintaining consistency and reproducibility in the build environment. These properties prevent unexpected errors when restarting CI/CD jobs or migrating from one build platform to another. Consequently, you can avoid costly build errors in ML model development, which often features long-running jobs that are difficult to interrupt.
Fortunately, you can use containerization to isolate deployment jobs from the surrounding environment to ensure consistency. Meanwhile, deployment using infrastructure as code (IaC) helps improve the build system’s reproducibility by explicitly defining the environment details and resources required to execute a task. As a result, the build is less dependent on platform-specific settings — you can reproduce and audit it easily.
Challenge 3: Testing and validation
Testing is crucial in developing any software project and especially for ML-powered programs. By nature of their complexity and training, ML models tend to feature implementation that is opaque to the user, making it near-impossible to determine a model’s correctness by inspection. Therefore, comprehensive testing is essential for proper software functionality.
The CircleCI platform excels at integrating testing into the development process. Support for automated testing makes it easy to ensure code performs as expected before it goes to production. You can customize tests on the CircleCI platform using one of many third-party integrations called orbs. You can then monitor them via SSH debugging or the Insights dashboard.
Learn how you can automate testing and training your machine learning models with CI for machine learning: Build, test, train.
Challenge 4: Security and compliance
Development teams must ensure that software is secure and compliant with consumer protection laws. This is particularly relevant for ML development, which often involves processing large amounts of user data during training. A vulnerability in the data pipeline or failure to sanitize the data could allow attackers to access sensitive user information. Therefore, security is a principal consideration at each stage of ML model development and deployment.
CircleCI provides several CI/CD features to improve the security and compliance of your application. You can control access to the pipeline using a role-based credential system with OpenID Connect (OIDC) authentication tokens, enabling fine-grained management of user access to each step within the pipeline. Additionally, CircleCI logs important security events and stores them in audit logs, which you can review later to understand the system’s security better.
Challenge 5: Deployment automation
New versions of ML models are often developed rapidly, especially during periods of heightened interest in AI. This makes it challenging to manage frequent updates to ML systems with several versions in development or production. To ensure a consistent user experience, you need an easy way to push new updates to production and determine which versions are currently in use.
Fortunately, you can deploy code to AWS, GCP, or any other targeted platform continuously and automatically via CircleCI orbs. Moreover, these deployments are configurable through IaC to ensure process clarity and reproducibility. Users can add a manual approval gate at any point in the deployment pipeline to check that it proceeds successfully.
Challenge 6: Monitoring and performance analysis
After deploying an ML model, you must set up production monitoring and performance analysis software. Due to the size and complexity of modern ML models such as LLMs, even a comprehensive test suite may fail to ensure their validity. The only way to determine that a model is performing as expected is to observe its real-world performance by collecting and aggregating metrics from the production environment.
Using the CircleCI platform, it is easy to integrate monitoring into the post-deployment process. The CircleCI orb platform offers options to incorporate monitoring and data analysis tools like Datadog, New Relic, and Splunk into the CI/CD pipeline. You can configure these integrations to capture and analyze metrics on the performance and behavior of production-phase ML models.
Challenge 7: Continuous training
During intense AI investment and expansion periods, new research, datasets, and improved models emerge daily. Therefore, production ML models must adapt to incorporating new features and learning from new data.
As previously highlighted, CircleCI’s support for third-party CI/CD observability platforms means you can add and monitor new features within CircleCI. But as new training data generates continuously, you can periodically feed it to the model using scheduled pipelines. This feature enables you to schedule events that trigger further training and deployment pipelines — allowing the production ML model to grow and update continuously.
Learn how you can retrain your machine learning models on a schedule or in response to performance metrics with CD for machine learning: Deploy, monitor, retrain.
ML models present unique challenges for engineering teams throughout the development process. Development involves several complex tasks: managing compute resources, finding consistency in the build environment, integrating automated testing, and ensuring automation and security. Finally, after deploying a model, you must add monitoring, performance analysis, and continuous training data integration to ensure the model works as expected and improves over time.
Using a CI/CD pipeline helps address these challenges in each phase of the development and deployment processes to make your ML models faster, safer, and more reliable.
To get updates on CircleCI’s AI roadmap and early access to new features that accelerate your AI and ML projects, sign up for the waitlist.