A few weeks ago we participated in DevOps Wall St., a 1-day event in Manhattan with speakers from FINRA, GitHub, Waffle.io, Modus Create, and others. Rob Witoff from Coinbase shared his thoughts in a talk called, “Security, Scalability, Productivity: Embracing a DevOps Mindset at Coinbase with CircleCI.” In this talk, Rob discussed correlating success with number of deploys, killing snowflake servers, automating good behavior with UX, and whose fault it is when something goes wrong in production (hint: not the deploying engineer’s).
Here are some highlights from Rob’s talk:
- (02:00) Rob shares stats on Coinbase: 100 services, 30+ countries, 6M+ users, 4.5 deploys/engineer/week
- (04:50) Rob talks about the engineering change request (ECR) process he used to have to go through in previous companies and why bureaucratic software development is not empowering, particularly when velocity is core to your survival.
- (06:35) “If you empower your company to move fast, iterate, and launch new features, you can help your company survive. DevOps done poorly can also be a reason companies die. If a company is not able to move fast enough because security slows you down, or there’s too much human interaction, you can really kill your company.”
- (08:28) In the last 12 months, Coinbase has launched 149,525 servers, deployed 13,146 times, and created 45 new services
- (08:45) Rob talks about how the fast pace of iteration and productivity at Coinbase has allowed them to become a market leader. Many competitors have disappeared because of inability to move quickly and securely. One of the philosophies at Coinbase is to empower developers to innovate and ship code without fear.
- (09:30) As Coinbase grew its engineering team, the number of developers deploying was not growing. Three engineers accounted for the majority of deployments. Because deploying to Heroku was so easy, they didn’t want to give everyone access without the proper gates. Coinbase built an internal tool that required multiple people to approve a change before merging to production. This resulted in a huge boost to productivity as the number of deploys dramatically increased.
- (11:11) 3 Tenets at Coinbase: 1) Everyone can deploy every tested master to production, 2) Production deployments start on your first day, 3) Failures in production are failed guardrails
- (14:00) Coinbase looks at successful, new deploys to production per month as a KPI. Target is 4/person/week.
- (15:00) Another KPI is time from PR to production. The team looks to aggressively automate every step between idea and production to make this process as fast and painless as possible. Coinbase uses GitHub Enterprise for code review, integrates security into CircleCI via the circle.yml file. Security issues fail automatically.
- (19:15) Coinbase uses CircleCI Enterprise with autoscaling groups with a custom Terraform script. Rob wants computers waiting on humans, not vice versa. Humans are much more expensive than computers, so this is a good investment. CI is the largest cluster Coinbase runs, but not the most expensive due to use of the AWS spot market and on-demand pricing.
- (21:43) Shout out to Gene Kim, the Phoenix Project, and the Puppet “State of DevOps” report. Companies who practice DevOps see 200x more frequent deployments, 24x faster recovery times for failures, 22% less time on unplanned work + rework.
- (24:00) A deploy velocity tool open-sourced by Rob looks at how often companies deploy. A rough correlation exists between how often companies deploy and the amount of bitcoin they move. A sign companies are moving fast, iterating fast, and continuing to grow.
- (26:30) Rob talks about snowflake servers and the Coinbase “30 Day Fleet” age. No server lives longer than 30 days. Constantly redeploying removes fear of change, motivates constant velocity, and is good for security. Constantly burning and churning environment to redeploy.
- (30:15) Redeploy > reconfigure.
- (33:00) Unintended velocity. What happens when you move fast without guardrails, or without fully understanding risk? Added proactive checks to circle.yml to test for issues before pushing to production.
- (38:00) When a deploy fails, it’s not the fault of the engineer. It’s because you didn’t have sufficient guardrails. Guardrails = anything that protects you from accidentally harming the company. Want guardrails to get feedback as fast as possible.
- (44:00) Learning from the NSA: you need to know your environment better than those who designed it and better than those who are securing it. “Scorched earth” exercise to force deployment automation and patching system. Because all services are integrated in CircleCI and run through CircleCI, scan all services every day to see what is out of date.
- (47:00) Guardrails as UX - automated warnings when you try to deploy on Friday after 4pm. “Deploy to production with great care.” Little forcing functions provide a lot of leverage in organizations.
- (49:00) Infra’s mission: provide self-service tooling with guardrails that empowers engineers to rapidly develop, monitor, and optimize services through low-risk deployment pipelines.”
- (50:00) Great ideas are serendipitous. Always look for blockers. What blocked your last great idea? Try to aggressively automate blockers. Most people will never complain. They just won’t deploy something great. If it’s too hard to deploy a great idea, most people won’t do it and won’t tell you.
- (56:00) Security at scale means you have to create an environment where people can get things done quickly. Use nuclear security paradigm - real value is protected by consensus. Sophisticated attacks can compromise individuals. We consider administrators to be single-point failures and try to design them out. No single individual can confirm a change to production.
- (59:00) Velocity comes from your foundation. Destroy snowflakes. Empower through self-service. Build into your foundation.
You can watch the full talk on Vimeo here.
For more on how Coinbase uses CircleCI, read the case study here.