Distributed systems learnings in 2019

With the year closing, I'm looking back at things I've learned while my teams were building and operating distributed systems.

Building a new distributed system is easier than migrating the old system over to it. In the past two years, I've been part of a few rewrites of distributed systems and observed some more. While planning, architecting, writing, and deploying these systems is an interesting challenge, it's not the hardest part. Migrating the existing systems over to the new one has been more time consuming and just as challenging as writing the new system.

Why was this? For one, we tended to underestimate how much custom monitoring we need to build to confirm the two systems work the same way. Second, the new system was built in an elegant way, without any workarounds. However, the legacy system it replaced has a lot of edge cases: edge cases we discovered during the shadowing/rollout phase. In some cases, we needed to make tough business decisions whether to support an edge case or drop it. Related to this, Will Larson has a good article on running good migrations.
A better way to improve reliability: measure, report, repeat. Start simple. For the past three years, the reliability of our systems was top of mind. However, we seemed to be threading water: for every improvement, we later had regressions, and I was never really clear on how reliable the system was over a longer period of time. This year, we took a step back: agreeing on simple SLOs (Service Level Objectives) and a "low bar" reliability number, like 99.9%. We then started to measure this on a weekly basis and report on it. Whenever we did not hit it, we dug into the root cause and fixed the systemic issue.

For the first time in three years, I feel a lot more confident in what the actual reliability of our systems is. Now that we're walking, we'll start jogging next year, with stricter SLO definitions and/or stricter reliability targets, like 99.99%.
Idempotency changes should be treated as breaking changes - even if they technically don't qualify as such. In one of the most interesting outages I've been part of this year, a system made a change, that changed a previously idempotent endpoint to behave as non-idempotent. The thing was, the idempotent behavior was not documented as part of the contract. However, clients assumed this was always the case: it always has been. The change caused clients to fail in ways that were not tested, causing an outage in parts of the system.

The team making the change argued that the change was non-breaking, and consumers should not assume idempotency. I strongly disagree. Idempotency, consistency, or durability changes for services should be treated as breaking changes: both in communication, rollout, and API versioning (in the end, we resolved the issue by agreeing to treat future changes as such).
Reliability & postmortem reviews are more impactful when going further and looking at systems issues that are hurting multiple teams. I've had far too many reliability and high-impact postmortem reviews the past year. We have a mature process around this, and teams come well-prepared, having done good analysis on the postmortems. Still, most postmortem reviews were very much about the team itself.

However, towards the end of the year, we started to focus less on the team-level learnings, but more on cross-team learnings. As we started to ask questions, "Could this issue impact other teams?" and "Is any team or tool blocking you from detecting or mitigating outages like this faster?", we started to dive into interesting topics. It turns out, many of the outages have similar pain points and root causes, that teams, individually struggled to address.
Unique and unexpected challenges when running your own data centers. Uber is one of the few places where we do run our own data centers, on top of using cloud providers. I have seen some interesting challenges around power failures, connectivity issues between data centers, service storage being split between two data centers failing over between a cloud, and a private data center. Unfortunately, I can't share exact details, but there have been a few "I had no clue this can go wrong - but it did" moments throughout the year. Suffice to say, I have a newfound empathy for teams at Google, Microsoft, and Amazon who are building data centers at scale.
Deploy on Fridays, Saturdays, and any day - but think about when code chills are sensible tradeoffs. Towards the end of the year, Twitter exploded on the "should we deploy Fridays" debate. If you are running distributed systems with distributed teams, you should deploy all the time. At Uber, I recall we do a few hundred deploys per hour - including the weekends. On the weekends, it's either because of outage mitigation or because of timezone differences (Saturday 7am in India is still Friday evening in the US).

The tooling, alerting, monitoring, and oncall should all make deploys a confident process, at any time. However, at Uber, we do have code chills for the very few critical times of the year, when our traffic surges and a large majority of the workforce heads to holidays. Specifically, around the end-of-year holidays. This is a conscious tradeoff teams make.

For example, 95% of my team is on holiday during some of the highest load days of the year. I'm perfectly happy trading off some deploy debt and a few days in January to sort merge conflicts to ensure a lower-than-average outage rate this time of the year, through a code chill. Again, your mileage might vary: if you are seeing less load during this time, a code chill or freeze is probably not necessary.
Financial / end-user impact of outages is just as important as the systems impact. A lot of outages I've seen during the year have been quite abstract in their impact. For example, "Service X was degraded for 10 minutes, and 50,000 requests were dropped". The impact of these outages was hard to determine. My team also had a few outages where we knew we were down, but the impact was hard to map to actual users - so we did an upper bound estimate, and called it a day.

For large systems, this is often not simple to make the connection between a downstream service and how its downtime hits consumers. It takes a lot of talking with consuming teams - listening, collaborating, and proactively reaching out after outages. The payoff is great, though. If an outage doesn't have a real financial impact: was it really that bad? If no user notices the outage, was it really a high severity? What is important is to be clear on the long-term user impact and measure this.

For example, an outage that slows down, paying for 10% of users by 10 seconds might sound like something that has no user impact, on the surface. But is that true? Did 100% of the users finish payment? Or did some churn, not coming back to complete their order? Did any of the users churn for good, meaning losing all the business with them? If you dig deep enough, you can put a price tag on all outages, which makes reasoning about how much time to spend on prevention all that easier.
A simple way to determine who owns a service: who owns the oncall? This year, my team transitioned several of our services to other teams. We had several debates on how to determine when a transition is complete. In the past, if you asked me how I determine service ownership, I might have mentioned things like code ownership, oncall, expertise and the lot.

Now, I believe whoever owns the oncall - the operating of the system - owns the system itself. Everything else needs to follow: code ownership, understanding of the system and so on. This approach made the transitions much more straightforward. It also reinforced my belief that shared oncalls between multiple teams can be a bandaid solution, but they are not healthy, on the long term.

This year, my team went from owning most of what is payments in Uber's Rider app - 10+ payment methods and all payments experiences across mobile, web and backend - to building and launching Uber Wallet across multiple Uber apps. We've migrated to our new distributed systems and built new, high QPS ones. Next year will be about scaling these up, operating them reliably and onboarding more teams on them. Hopefully a lot more learnings to come that I can share.

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 software engineering newsletter on Substack.

Menu