How Perk cut recovery time to 31 minutes with centralized rollbacks in CircleCI
Staff Software Engineer at Perk
In this guest post, James Butherway, Staff Software Engineer at Perk, describes how his team accelerated incident recovery with CircleCI’s built-in rollbacks feature and scaled those benefits across dozens of services using the Platform Team Toolkit.
Want to see how this works for your team? Contact us for a demo or get started with CircleCI.
Perk (formerly TravelPerk) is the intelligent platform for travel and spend management. Our DevOps team, part of the Foundations tribe, enables engineering squads across the department through platform tooling, infrastructure support, and operational excellence.
We have high ownership over the CI/CD foundations for the entire Builders department, maintaining the continuous integration and deployment pipelines that all squads use daily. Our partnership with CircleCI has been key to this mission, enabling us to build sophisticated deployment workflows and implement advanced features like the centralized rollback capabilities described in this blog.
Our tech stack
Perk’s infrastructure is built on multiple clouds, leveraging a modern, cloud-native architecture. Many of our services run as containerized applications in, for example, Amazon Elastic Container Service (ECS), providing scalability and reliability for our platform.
Our more modern services embrace a serverless-first approach where appropriate, making extensive use of AWS Lambda functions or GCP Cloud runners for event-driven workloads and microservices. This is complemented by other cloud native managed services, including API Gateway for API management, DynamoDB for NoSQL database needs, and various other AWS/GCP products that enable us to focus on delivering value rather than managing infrastructure.
All AWS workloads use distributed CircleCI configurations in their repos to build and deploy. Both project styles needed to support rollback to improve mean time to resolve (MTTR) during incidents.
The challenge: Our old rollback mechanism
Before implementing the centralized rollback solution, Perk’s rollback capabilities evolved through two generations of bespoke command line interface (CLI) tools: a legacy Lambda CLI that was no longer maintained but referenced in old documentation, and a newer ECS CLI that suffered from poor adoption and engineer awareness.
Together, these tools created a fragmented rollback experience that was difficult to rely on during incidents.
The main challenges with our legacy CLI tools were:
-
Outdated documentation: CLI tools lacked comprehensive documentation, with much of it referencing obsolete tools that confused engineers.
-
Low confidence in legacy tooling: Low levels of interactions with these tools led to hesitancy in initiating rollbacks, increasing MTTR. Most engineers opted for a roll-forward approach instead.
-
Limited rollback functionality: The older CLI could only roll back one version easily, lacking flexibility for multi-version rollbacks.
-
Limited visibility and auditability: Minimal feedback and a lack of audit logging made rollbacks difficult to track during and after incidents.
-
Maintenance burden: Bespoke tools required ongoing maintenance, leading to outdated scripts that sometimes failed during incidents.
Due to this fragmentation, engineers typically resorted to two suboptimal approaches during incidents:
-
Full pipeline re-run: Triggering the entire CI/CD pipeline, including all testing and build stages, which significantly extended recovery time.
-
Revert and hotfix merge (most common): Creating a new code change to revert the problematic deployment and merging it as a hotfix.
Our target was to achieve an MTTR of 30 minutes or less for high-priority incidents, a goal that required a fundamental reimagining of our rollback capabilities.
We define MTTR as the time elapsed between an incident causing instability on the platform and the restoration of full stability for our users. It is important to note that incidents vary in severity; even high-priority events rarely equate to a complete platform outage. Nevertheless, improving overall platform resilience remains a key objective for Perk, and accelerating recovery times across all incident types was the primary goal of this project.
What we achieved by adopting centralized rollbacks in CircleCI
By implementing CircleCI’s rollback pipeline feature alongside the platform team toolkit, we achieved significant improvements:
-
Rollback Speed: P95 on our core app: 7min 43 → 3min 48 (48.7% decrease)
-
Faster MTTR: MTTR in an incident where the new rollback was used was 31 minutes.
-
Configuration maintenance: All simple and complex rollback functionality in the platform is now supported by just 4 CircleCI config files, replacing bespoke CLI commands.
-
Team adoption: Using the CircleCI API, we’ve automatically enabled rollbacks for 80 active projects. Of those, 8 have successfully rolled back in less than 2 months of the feature’s release.
Implementation walkthrough: Turning CircleCI into a central rollback control plane
Our new approach leverages two powerful CircleCI features working in tandem: the rollback pipeline capability and the Platform Team Toolkit. This combination addresses all the limitations of our previous CLI-based system while providing a superior developer experience.
Centralized orb and deployment markers
Much of our deployment logic was already centralized in a versioned CircleCI orb. Extending this to support deployment markers (a prerequisite for CircleCI’s rollback feature) required a small and straightforward migration process. We leveraged the CircleCI CLI available on all runners to ensure deployment markers were consistently added across our Lambda and ECS services in just a few working days.
Using the CircleCI rollback wizard to create a template
Once deployment markers were in place, CircleCI’s rollback UI became available. When clicking on the “Rollback” button for projects without rollback configured, CircleCI provides a setup wizard that guides you through the process in under an hour. The wizard analyzes your existing deployment workflow, uses CircleCI’s GitHub integration to create a feature branch with the necessary configuration changes, and automatically raises a PR for review.
The automatically created configuration stood as a perfect template for the eventual centralized rollback files. It is missing the core functional part that actually rolls back a service, but once we had engineered this script, it was very easy to insert it within this YAML configuration.
The CircleCI rollback UI
This rollback script, attached to the project, enabled the end-to-end rollback flow for our first service. This was a transformative upgrade from our CLI tool. Instead of remembering command syntax or hunting through documentation, engineers could now:
-
Click the “Rollback” button directly in the CircleCI interface
-
See a visual list of all previous deployment versions
-
Select the desired version to roll back to
-
Trigger the rollback with a single click
This intuitive interface eliminated the confidence problem we faced with the CLI tools. Engineers who use CircleCI daily for their normal workflows now have rollback capability seamlessly integrated into their existing toolset.
Scaling adoption with the Platform Team Toolkit
While the rollback UI solved the usability problem, we still faced a challenge: how could we make this capability immediately available across all our services without requiring every team to manually update their configurations?
The answer came through CircleCI’s Platform Team Toolkit. This feature allows us to centrally manage rollback configurations and automatically distribute them to projects across the organization. This, coupled with the CircleCI API, allowed us to:
-
Define a standardized rollback configuration that works for all services
-
Automatically enable rollback for 80 active projects across both Lambda and ECS workloads
-
Eliminate the need for individual teams to manually set up rollback for their projects
This centralized approach solved our documentation and communication problems. Rather than relying on engineers to discover and configure the CLI tool, rollback became automatically available to everyone through a UI they use each day.
Flexible configuration through overrides
We recognized that some teams would need more sophisticated rollback workflows beyond the simple “revert to previous version” use case. The platform team toolkit enabled us to offer two tiers of rollback configuration:
-
Standard rollback (centralized): A simple, zero-configuration rollback that works out-of-the-box for services. This configuration is maintained centrally by the DevOps team and automatically updated across all projects.
-
Advanced rollback (project-specific): For teams that need custom pre-rollback or post-rollback actions—such as Slack notifications, database migrations, or cache clearing—they can extend the centralized configuration in their own project while still inheriting the core rollback functionality.
This two-tier approach gives us the best of both worlds: the simplicity and consistency of centralized management for most teams, with the flexibility for advanced use cases when needed. The core rollback logic remains centralized and maintained by DevOps, while individual teams can customize the workflow to their specific needs without reimplementing the entire rollback mechanism.
Standard rollback configuration example
This is a script taken directly from our centralized configuration repository. This repo was created to hold all the core configurations that we plan to share across multiple CircleCI projects.
# This workflow performs a rollback of an ECS services task definition to a specified previous version.
version: 2.1
orbs:
common: travelperk/com@4.9.0
jobs:
rollback-component:
executor: common/aws
environment:
COMPONENT_NAME: << pipeline.deploy.component_name >>
NAMESPACE: << pipeline.deploy.namespace >>
ENVIRONMENT_NAME: << pipeline.deploy.environment_name >>
TARGET_VERSION: << pipeline.deploy.target_version >>
steps:
- checkout
- attach_workspace:
at: .
# This step will create a new deploy with PENDING status
# that will show up in the deploys tab in the UI
- run:
name: Plan release of ECS << pipeline.deploy.component_name >> rollback
command: |
circleci run release plan \
--environment-name=${ENVIRONMENT_NAME} \
--namespace=${NAMESPACE} \
--component-name=${COMPONENT_NAME} \
--target-version=${TARGET_VERSION} \
--rollback
- common/aws_login:
default_aws_profile: << pipeline.deploy.environment_name >>
- run:
name: Perform ECS rollback for << pipeline.deploy.component_name >> to version << pipeline.deploy.target_version >> in << pipeline.deploy.environment_name >>
command: |
update_service_task_definition_by_version.py \
--environment=${ENVIRONMENT_NAME} \
--cci-component-name=${COMPONENT_NAME} \
--version=${TARGET_VERSION}
# These last two steps update the PENDING deployment marker
# to SUCCESS or FAILED, based on the outcome of the job.
- run:
name: Update planned release to SUCCESS
command: |
circleci run release update \
--status=SUCCESS
when: on_success
- run:
name: Update planned release to FAILED
command: |
circleci run release update \
--status=FAILED
when: on_fail
# This job handles the cancellation of the rollback deploy marker
# if the rollback job is canceled
cancel-rollback:
docker:
- image: cimg/base:current
steps:
- run:
name: Update planned release to CANCELED
command: |
circleci run release update \
--status=CANCELED
workflows:
rollback:
jobs:
- rollback-component:
context: [aws, github, codeartifact]
- cancel-rollback:
requires:
- rollback-component
- canceled
filters:
branches:
only: main
Improvements at a glance
-
Simple “two-click” rollback in a common interface, increasing speed and confidence
-
Multiple versions are now available to rollback to, instead of just the last deployed version
-
Clear metadata about versions that can be rolled back to, with commits from GitHub
-
Descriptions of rollback actions enable an audit trail for accountability and security
-
Bespoke CLI can be retired, removing code maintenance from DevOps replaced by config files
The following feedback reflects how engineers experienced the new rollback workflow in real incident conditions.
The new rollback was really handy in the moment! Especially in a situation like an incident. You don't want to understand first how a script is working, so having this in CircleCI in a simple UI where all you have to do is to provide the commit hash you want to rollback to, came in very handy.
– Senior Software Engineer at Perk
Conclusion: Where we are now, and where we’re going next
By replacing our fragmented rollback mechanism with CircleCI’s rollback pipeline feature, integrated with the platform toolkit, we have:
-
Significantly reduced MTTR during incidents, bringing it closer to the business goal of 30 minutes for critical services
-
Significantly reduced rollback time in over 80 projects
-
Achieved organic adoption in 8 projects without requiring training
-
Reduced code owned by the DevOps squad, replacing bespoke logic with config files
-
Proved that the centralized configuration can work well for CI/CD actions
The combination of CircleCI’s rollback pipeline feature and the Platform Team Toolkit has transformed how we handle reverting deployments and incidents, making our systems more reliable and our teams more productive.
Building on this success, our next steps include:
-
Rollback in frontend projects: Currently, only backend projects are supported with rollback, but our frontend projects also use CircleCI. We are currently in discussions with our Frontend engineers to imagine how this same pattern might be used to roll back our microfrontends.
-
Shared configuration for deploying to lower environments: Rollback and deployments are quite similar actions, both perfect candidates for centralized config. With increased requirements for easier deployments to lower environments, we hope a similar approach can be taken to roll out new pipelines for deploying a version before prod separate from build and test jobs.
-
Overriding configurations for core pipelines: Currently, when we create a new project, we duplicate a config template with the workflows and jobs we expect teams to use. Through the use of the Platform Team Toolkit centralization and overrides, we may be able to reduce the duplication of config required and obtain better control of the structure of project pipelines while still enabling innovation and complexity in projects that need it.