Vulnerabilities are continually discovered in software packages, software libraries, operating systems and infrastructure. Vulnerability management is the ongoing process of scanning, classifying, prioritizing, and patching software vulnerabilities. All modern technical stacks now require this cyclical maintenance and updates in order to be stable and secure.
This is one of the core facets of security compliance. In the past few years CircleCI has gone through FedRAMP certification and SOC 2 Type II compliance, both of which paired us with auditors who needed the details of our vulnerability management processes.
While vulnerability management is a foundational security practice that is absolutely essential, it is also generally tedious and repetitive. This makes it a prime candidate for automation and optimization. Building processes stringent enough to satisfy auditors without accruing massive overhead for our team and our engineering organization posed an interesting challenge.
On a more personal note, working through this challenge is when the DevSecOps benefits of Docker finally clicked for me.
In this post, I’ll walk you through the vulnerability management process we developed, built to satisfy auditors and align with our team’s use of CI/CD, Docker, and Kubernetes.
Patching with a Docker tech stack
A little background: CircleCI’s stack is based on Docker. There are already many articles discussing how this differs from traditional server-based architectures, and how that impacts DevOps and development. It also impacts security, and in particular for this post, vulnerability management.
Docker images are immutable and contain their own infrastructure packages and libraries, which means you can’t “patch” one in the traditional sense of “applying a patch to a server”. Instead, a new image with updated dependencies is created and deployed and old ones are decommissioned.
The first corollary of this is that patching will cause more deployments that image owners and SRE teams need to be aware of and be able to handle. We had the benefit of using our existing and stable pipelines for continuous deployment, which meant that additional deployments had minimal impact on our operations.
Second, Docker enables spreading out the maintenance cost of patching across more of the engineering organization. In traditional server-based environments admins or SREs maintain servers and deploy patches. At CircleCI, development teams are responsible for maintaining their Docker images, and are therefore also responsible for applying security patches to them. Some development teams may need to acclimate themselves to doing security maintenance work related to the Docker images they use.
Third, many compliance professionals and auditors aren’t deeply familiar with how this technical stack works and its ramifications for vulnerability management. It is important to set aside time to work with them on understanding how these tools affect the patching process.
Sketching a plan
Installing a vulnerability scanner doesn’t secure anything on its own. As with all security tools, in order to get any benefits, the tools need to be applied our systems.
When we started this process, we knew our plan would look something like this:
- Get an idea of how bad things actually were and get some high priority patches in-flight.
- Start providing a tighter feedback loop for teams that own the Docker images.
- Revisit the tooling and building automation around ticket generation and then reporting.
- Find a way to help teams optimize and tune their patching processes.
A central principle that would inform this entire effort was delivering feedback to engineering teams, using tools and interfaces they already use every day: a far more effective approach than using distraction-based approaches like Slack, reminders, emails, meetings and reports.
The universal starting point: spreadsheets
The first thing we needed to do was to get a handle on the scale and scope of the project and be able to sort and filter the raw data to understand our situation.
So we started in the same place many projects start: spreadsheets.
We ran .csv exports into spreadsheets, added formulas for summaries and some simple AppScript code to do a bit of data cleanup and reconciliation. We were looking to understand the different images and containers that were currently running in our live system, what vulnerabilities those had, and what patching cadence existed; just what was already there, before making any tickets or talking to any development teams. Which services had official owners, and which would we have to go find owners for? What was the shape of the data from the tooling? It was important for us to look at this data manually in a spreadsheet first, so that when we automated it, we’d understand the data coming out of our scanner.
This step provided a lot of details that gave us an initial view into some of the inconsistencies and complexities in the data. Another important set of information was the number of images and containers that didn’t have any official team-owners listed.
Then it was time to assign tickets, focusing on images with critical vulnerabilities first. We generated the first round of tickets by hand and assigned them to teams. Because some of the images didn’t have owners listed, we made a few educated guesses and then watched if those tickets were moved to other teams.
A number of teams that received these tickets weren’t used to getting tickets from other teams, much less from the security team. We got some pushback at first - development teams hadn’t previously been tasked with Docker image maintenance, and they had concerns about how much time it might take. We worked with these teams to help get initial patches deployed, and it gave us an opportunity to discuss the importance of patching, and how it was going to become a more regular thing.
Another set of conversations that happened at this stage was with our auditors. After rolling out the first draft of our policy, we realized our patching timeframes were far too aggressive (turns out the auditors had thought this but didn’t push back). Development teams couldn’t keep up. We realized it was better to have a reasonable policy and hit it than be too aggressive and constantly make exceptions. We worked with our auditors to update the policy to a realistic level.. This was an important lesson for us about the level of maturity our processes were at. We could tighten up time frames later, once we had repeatable processes.
One thing we learned at this stage was to avoid perfection. Do not block yourself if a specific team is deep underwater, a ticket gets kicked back and forth between a few teams, or a specific patch having technical problems. Make the progress you can with the patches and teams that you are able to.
Divide and automate the workload with CI/CD
We made the initial progress that we could and moved onto the next phase. This is the single biggest change that resulted in services getting patched regularly. This is also where, as a developer, the benefits of Docker really came into focus for me.
In traditional server-based infrastructure, the testing environment is separate from the production environment and those require independent patching and much more task-based overhead to discover whether any particular patches cause problems.
Because Docker images contain their software dependencies in a clearly defined unit of deployment, it becomes straightforward to scan those in CI/CD pipelines. With automated testing that includes the software dependencies, patches can be quickly tried and validated using existing tests.
By integrating it into the CI/CD pipelines, vulnerability patching wouldn’t have to be a big thing we did every month. It would just be baked into the way we ship software.
For those planning to adopt this on your team, know that this will cause pain at first. Even if your team is great at CI/CD, any time you add a step into your build pipeline and it causes builds to break that otherwise would have passed, people will be frustrated. You’ll have to work with them to get back to green. This is okay, but be prepared for it. Any systems that haven’t been patched in a while will take notably more time to clear up, so make sure to coordinate with teams and management as you do initial rounds of patching. Roll things out to groups of services and be prepared to give temporary exceptions or disabling of enforcement for specific services for short periods while those initial rounds of patching are occuring.
The central principle at this stage is that quick feedback loops embedded in the development process help teams make it a part of ongoing activity instead of requiring large chunks of separately tracked work. This makes it much easier for teams to comply.
Automate production scanning
As game-changing as CI/CD integration is, alone it is not sufficient for vulnerability management. It does a great job of catching vulnerabilities caused by new deploys, but one of the challenges with vulnerabilities is they’re discovered all the time in existing software. A vulnerability could surface today that affects the code that passed all tests and went into production yesterday. CI/CD is only one half of the puzzle; production environment scanning is the other half.
Put another way, for efficiency in interfacing with teams, integrating with CI/CD can’t be beat. But for seeing the full picture of your vulnerabilities, production scanning is required.
Back to the spreadsheets we went …
… but not for long. Spreadsheets are great for getting an overview of your data, but if you want to start generating tickets and reporting into how your vulnerability management is going, you are going to need more analysis than fits comfortably in a spreadsheet. So we built an integration pipeline that automated the biggest manual work for the security team. This meant taking data out of the production vulnerability scanner and putting it into tickets. As I emphasized before, it’s a really good idea for security teams to meet engineering teams in the tools they are already using on a daily basis, so getting our program to turn data into tickets was essential.
This wasn’t just a simple connector. The process of taking data from the vulnerability scanner and turning it into tickets involved complex details that required us to use a real programming language. I chose to implement this automation in Clojure, because that is the primary development language here at CircleCI, and it’s important when making technical decisions to keep maintenance in mind. That necessarily includes the consideration of having other people at the company that can maintain the tooling.
The integration sounds super simple at first glance: Take data out of your prod vulnerability scanner, transform it, and make tickets. Famous last words.
There were a number of challenges to keep in mind with this integration:
We needed to update existing tickets so that open tickets were not duplicated. We added a vuln-mgmt-id in Jira to track image/ticket identity mapping
Both the source and destination APIs changed over time, quite significantly. To handle this, we build abstraction layers around the APIs we integrated with, both Jira and our scanning tools.
The service had to normalize the weird data for all the different distros, libraries, and tools into something consistent. This ran from normalizing ‘moderate’ vs ‘medium’ (easy), to every vendor having a different format for “fixed in version” (harder).
To handle this, we built a data normalization layer that canonicalized severities, deduplicated CVEs (common vulnerabilities and exposures) in sub-packages, and parsed many different string formats for fix-versions. We also normalized package information to provide clear requirements to teams about which version they should patch to inside the ticket details.
How to assign tickets to the correct destination when engineering projects and teams change in Jira. We created a flexible team-assignment configuration, including regex and partial matching. Another key learning in the team assignment was to throw an error if the configuration matched multiple teams or zero teams. This notifies the security team if any additional configuration is needed.
This took quite a bit of work, but resulted in a simple command line application that could read the latest information, generate detailed tickets, and assign them to teams in a matter of minutes.
The key principle at this stage was to invest in automation that eliminates repetitive and error-prone work from the security team. Things like the flexible team configuration save time every week, as small changes to team assignment or unknown containers happen fairly often.
Reporting on production
A key thing to remember about security work is that it isn’t enough to do the right thing; you have to be able to show that you are doing the right thing. The goal in reporting is to create a single report covering everyone’s (executives, team managers, auditors) needs. This not only saves work but helps ensure everyone has the same understanding of the state of things.
Since vulnerability management for Docker differs from traditional patching, we had to work with our auditors so they could understand how we were assigning work and make use of the reports we were generating. Traditionally, sysadmin teams would be the ones doing patching, and that’s what our auditors were used to seeing. It took some work to explain that, in our case, every engineering team across the org would be doing patching. After some back and forth, we added some additional details to our report and were able to design one report format that met the needs of our FedRAMP auditors as well as our internal stakeholders.
The key principle of this stage is to investigate and figure out a single report satisfying all interested parties, and then generate it automatically.
Work with individual teams to optimize how they patch. Make sure this is working-with and not dictating-to. The goal is to help them fit patching into existing workflows, not make new ones. Keep in mind that engineering teams are more familiar with their processes and will have some good ideas on how to optimize in ways that the security team may not have thought of.
Three techniques that were the most productive here were automated integration testing, shared base images, and soft-version specification.
- Automated integration testing, which should be part of CI pipeline in general, allows for patches to be applied quickly and with more confidence. Because the patches automatically run through the full suite of integration tests to detect issues caused by the patch, this allows for a patch-and-see approach that requires much less human effort.
- Shared base Docker images allow for services with similar infrastructure and software requirements to be based on a common image. The collection of common updates is done inside that one image and then rolled out to all the specific services. This lets most dev teams simply bump to the latest shared as the first step in any patching.
- Soft-version specification is done by simply specifying a major version or major + minor version of packages and letting the package manager automatically update to the latest versions as part of a scheduled Docker image build. If that scheduled build runs weekly, and removes much of the human intervention to ongoing patching. There are some operational risks associated with this kind of automated version bumping, so if this level of automation is started, it is important to have thorough integration tests and to try it out on less critical services first.
The core principle at this stage is to make it easier to just apply the patch. Doing in-depth analysis and prioritization of vulnerabilities is expensive and time consuming for security teams. There is also a lot of risk around mistakenly marking high severity vulnerabilities as not applicable in the environment. If the patch flow is simple and straightforward, teams will simply apply patches and make everything easier.
Tips for vulnerability management
(or, how to work with auditors, Docker, and CI/CD systems)
- Engage with engineering teams using the tools and processes they already have in place.
- Don’t be too dogmatic at first.
- Build patching into your existing CI/CD.
- Invest in automation of vulnerability management.
- Create a single report that has the information needed for all stakeholders.
- Make applying patches simple and straightforward.
When rolling out vulnerability management, it is easy to get caught up in early concerns and enter firefighting mode. Remember, this is not about one round of patching, a single critical vulnerability, or one team or image. It’s about a consistent repeatable automation that continually improves security while minimizing impact on the workflows of engineering and security teams.