I've worked on many projects throughout my career as a software engineer. However, Helix - Uber's Rider app rewrite in 2016 introducing Swift and RIBs - stands as the craziest project by a margin.
This is my story and some of my journal entries from the time.
It's the middle of 2016. Uber has been dominating headlines: the company on a rocket-ship growth trajectory and is changing the way people get from A to B. Valued at $62.5B, the company does not seem to be slowing down.
And I got an offer to join Uber as a senior software engineer!
I'm be moving to Amsterdam, joining a team building Uber's payments systems. I'm beyond excited.
My interview process was messy, and although I thought I was interviewing for a mobile role, I ended up on a backend interview loop and getting a backend engineering offer. I took it.
Two weeks before my starting date, my manager emails me to ask if there's any way I can do Android. He can tell more details when I've joined. I've done Windows Phone and iOS before - Android would be the first. I say yes, though I'm a bit puzzled at the request. (Little did I know that I would not have written Building Mobile Apps at Scale if this last-minute request did not come in).
On week one, I get the scoop on the unusual and last-minute request. We're rewriting the Uber app: the whole thing. The codename for the project is Helix. Over 1M lines of code per platform, with more than 100 iOS and 100 Android engineers working on the project. But we need more hands on deck. Hence me joining the Android team over the backend one.
The mobile platform team has been working on the project for a few months, coming up with RIBs, a new architecture we're using for the project. For the program teams - the ones building features -, the work started a few weeks ago. We have less than 3 months: the rewrite needs to be done by 16 September. Every last screen of the old app will not only be ported, but re-imagined, UX and workflows changed.
I am still getting my head around all of this. Why so fast?
I hear stories of a party at Las Vegas at the end of 2015, where TK announced that a new Uber app will be on the store by the end of next year. Apparently, most people forgot about this announcement, but not TK. So rewrite it is.
Why 16 September? It's because of counting backward.
Apple freezes the store at Christmas. So we need a stable version of the app by 24 December. This means 100% rollout a month before. So public launch on 2 November. Company-wide beta on 16 Oct. Employee testing starting 16 Sep. This is why.
To make this deadline of 3 months away happen, we need Android and iOS engineers. Lots of them.
"This will be fun" - I say confidently to my new team.
"I don't think I knew what I signed up for" - I tell my wife.
Two Months of Hell
Middle of August, I board a plane to the Uber HQ in San Francisco. The plan is to work a week from the office and get to know the US teams. On the second week, I'll attend Uberversity - Uber's mandatory onboarding "bootcamp", where I'd learn about the business.
I never enrolled at Uberversity.
A few days into my stay at HQ, I am called to an emergency meeting. We were behind. "We" - meaning the Amsterdam payments team. Our features won't be ready for the 16 September launch. And without Payments working, the launch would be delayed. Which is not an option.
In an unexpected twist of events, my manager asks me to take over leading the Amsterdam part of the project, given both my background and some of the early contributions. I've seen enough to know that we are up against an impossible task, so I replied with the only sensible answer:
I tell him we cannot make the deadline. Not unless we cut the scope, push back the timeline, or add more people. Ideally, all three. He says we'll see what we can do.
In the end, I say yes: but only on the condition that we are not committing to making the deadline. We can only commit to reduce by how much we'll miss it by.
The next two months will be hell. I extend my US trip, and work seven days a week. More people from Amsterdam fly out and they do the same. We are not alone in the office on the weekends: a third of the mobile engineers and most managers show up.
The pace of work is still excruciatingly slow. Tooling is a major issue: compiling the app from scratch takes 25-30 minutes on iOS and around 10 on Android. Landing a change to master can take hours and hours, due to to the long-running test suite, the merge conflicts thanks to the large number of changes and flakey UI tests.
At one point, all UI tests are removed from the CI. This speeds things up for a few days. When they're added back in, it turns out that a dozen regressions quietly also made their way into the codebase.
We keep on pushing, heads down. Both me and all engineers from Amsterdam are putting in easily 80+ hours per week. We come to the office at 9am and don't leave until midnight or 1am. Most mobile engineers I know do the same, except they continue to work from home.
Ariana Huffington joins Uber's board around the same time and releases her book on the importance of sleep right around this time. Piles of free copies are available in the office. Yes, sleep is important. So why am I at the office past midnight? Right: the deadline that can not - or will not - be moved.
The App Release Strategy: YOLO
It's late October. Somehow, we have a rewritten app that works... for the most part.
Now the next question: how do we release it to the public on 2 November?
When you have an app that generates close to $1B/month in revenue, you tread really carefully. Make small changes, A/B test them and improve slowly over time. For a full rewrite, you a do gradual rollout.
However, Uber has a problem that starts with Swift and ends with Apple. We rewrote the app using Swift 2.2. The language is mature enough - but not the compiler. The compiled code is multiple times the size of that of Objective C. And as per Apple's restrictions, if your bundle size is over 100MB, users cannot download the app over the air: they need to be on WiFi to do so.
Getting the rewritten Swift app to fit under 100MB was a challenge is a story by itself. Bundling the old Uber app and the new Uber app together goes beyond 100MB.
This was a massive headache because a large majority of first-time Uber users download the Uber app over the air. The iOS app might be around $1B/month business, but the very high growth depends on more users signing up every day. And if we push a bundle that's more than 100MB, this growth stops overnight.
So we are left with these options:
- Ask Apple to get rid of the 100MB over-the-air update limit, and do as Android does. Oh, wait - we asked. They said no.
- Take a magic wand and shrink the bundle with the old and new apps so they fit within the 100MB limit. A team worked on this around the clock for the past two months. They came up with incredible workarounds to reduce the app's size. But we won't get here.
- Throw out the whole project and restart in Objective C. This is on the table. Seriously.
- YOLO. Replace the old, Objective-C app with the new, Swift app on the store. The app that a few thousand Uber employees and select few contractors have been testing for less than two months. What could go wrong?
None of us know for sure what the app release strategy will be until a week before launch. Then it's confirmed. We're going YOLO.
Apparently, TK personally made this call. I still can't believe what this means: we have one week to fix all known issues before it goes out to millions of users. And who knows of the issues we don't yet know about.
A Few Days Before Launch
The following sections are edited versions from my journal at the time.
We've been dogfooding the hell out Helix for weeks now. Tons of reports coming in. We can barely keep up with processing the feedback. The biggest complaints these days are layout issues here and there. Our designer actually squeezed in tweaking a font color as a high enough priority task for this last week. So yeah - we're looking good.
Also, whatever we do this week will only go out a week after launch. The launch build was cut yesterday, and people will be testing the hell out of it. If we discover anything really bad, we can still alphafix the launch build.
Anyway, I'm on call for the launch week. So if anything goes wrong, I'll be the first one to know. No pressure.
One Day to Go
My alarm rings... or so I think. It's my phone. It's 2AM. My manager is on the line.
"I'm sorry for waking you up. We have an urgent situation. Last night we discovered an issue that might be a launch blocker. We have 16 hours until we release Helix and need to investigate now. Can you look into this now?"
Oh yeah - I now remember I'm the on call. Of course I'll look into it. I turn my laptop on, get the latest build, and start to reproduce the steps. The issue is related to a flakey feature. Sometimes it can fail fraud checks that are difficult to debug. And sometimes it fails because a third party has a brief outage.
I quickly do the repro, expecting the test will pass and I get to go back to bed. It fails the first time. As I said, it's flakey. It fails the second time. That's odd. It fails the third time. Damn. Did this thing really break? I need help. So I ping the backend oncall. He's already online. Saves me from waking him up.
I spin up my dev environment to get some debug and network logs. Meanwhile, the backend engineer also does the sanity checks. Fails for him as well. I get the logs and send them over, and we start debugging.
An hour later we have some ideas on what's going on. Two hours later it's a dead end: we start over. Three hours later it seems like it's a client-side issue that we need to hotfix. Four hours later, at 6am, we are pretty certain that this bug is not a client-side regression. We can just fix it on the backend: no client changes needed. We have 12 more hours to decide what to do and I'm exhausted. We agree to regroup in 4 hours with a fresh mind, and log off.
I keep hitting snooze on my alarm. The countdown timer is at T minus 6 hours by the time I get to the office. We need to decide in an hour if we need to do a last-minute hotfix, or we launch with this known issue. The people in the office did some more debugging. The bug is a bad one, but we can launch with it. So we make the call to not hotfix - we'll sort it by the next release.
An hour to launch, we pile into a big meeting room, and get the video link with the Command Center. In HQ, they set up a special room where they'll be monitoring the launch for the first 24 hours.
We do the countdown... 3.. 2... 1... and someone presses the App Store & Google Play release buttons. We pop the champagne open. At the time of releasing the app, the press embargo is lifted. At least 20 news articles pop up instantly on Google News. The headlines are positive and cheesy. A few are on the monitor for display. Someone reads out loud some ridiculous parts of the articles. We cheer when we see a screenshot with a payments feature - the screens that we built.
It feels like we just exhaled after holding our breath for months. Someone says, "let's order pizza!" and we all agree. A few engineers check dashboards - but it's too early to see any data. After an hour of socializing, I bike home. The first part of the launch was smooth. I feel great.
By the time I get home, I already have a few DMs on my phone.
Dude, are you available? We have a problem. Ping me when you're home ASAP.
Dammit. It seems the feature my team wrote - or at least part of it - is broken. A bunch of users are reporting a weird error message. It's unclear if our team's code is causing this issue or it's something else. Right now, it doesn't really matter.
People are pouring into the incident chat room, trying to figure out what is going on. I join the Zoom call with the Helix launch Command Center. Apparently, we're not the only ones with launch issues. There are five ongoing issues right now with other teams... and of the five issues, our issue seems so far to have the biggest impact. Dammit even more. And I was feeling good about the launch half an hour ago.
Multiple people on the team are diving deeper. We know it only happens on iOS, not on Android. The backend team has an idea of why this is happening. I look into the mobile code, but I've not worked on this part, and I'm just as confused as to what's causing it.
One of the mobile engineers joins the chat and says he has a fix ready and already landed. It just barely missed the build cut. Says he knew it was a problem, but thought it would barely affect anyone, so decided not to alphafix. He says it's all tested and works fine. Ok, so maybe this is not so bad.
Another team has the numbers coming in. Someone says it's affecting 15% of users. Surely that's wrong. Ok, correction, it's a 15% elevation in the error rate for this feature from the old app. Still that's a massive increase and something we cannot ignore. We decide to fix it on the backend, as the client fix might take at least a day or more to deploy.
Our backend team starts looking into a fix. In the Command Center, some other senior backend folks show up to help resolve this quickly. After 30 minutes the backend team has a fix in the works and say it will be done in another half an hour. Phew. We keep working on the client-side alphafix as well, just in case.
Fuck It, Ship It
Update from the backend team: they can't ship the fix. Legal just got involved and are blocking the backend fix, as is. They say we need to push out the client-side change... for legal reasons. I disagree with their assessment, but it doesn't seem like the time to argue. Client-side fix it is.
The hotfix is in, and a new build is out—time to confirm the fix working in prod. Four of us engineers all do the repro steps on the old app, where the bug appears for all of us. Then we test the new build with our production accounts to confirm the fix works. For three of us, it works like a charm.
Except for one engineer. Says he gets the same bug. What? So the fix is only partial? We can't ship this yet. We start to debug, using the last engineer's account information.
Another hour passes, and we're still not entirely sure what's wrong with the last edge case. Seems like his account might be compromised. That's the problem when testing with production accounts - we have all sorts of junk accumulated, which most users would never have. We're talking with a team who seems to have forgotten to clean up some user data, that is causing our fix not to work. They say this is an edge case, and very few people will be affected. Our engineer just got unlucky.
Do we play it safe and debug some more?
Is three passes and one failure the wrong way to go about this?
Fuck it, ship it.
We give a thumbs up for the release team to take the hotfixed app and do the emergency release.
The code works better than before. We'll debug to see if there was an edge case left. In the meantime, let's ship what works.
It's Not Over Till It's Over
I have no idea how, but two and a half-hours later, the fix is live. I'm assuming the release team used all the goodwill accumulated over the years to make the fastest app update I've ever heard of happen. Either way, it's done, and done fast. Let's just hope it works.
The fix works: a user who saw the issue confirms updating the app resolved it. Exhale.
It's 3am, one critical fix down. But how many more to go? Usage numbers are already far above all projections. Why so many? It's the marketing team, who have outdone themselves for the launch. The news of the Uber app is on every online publication. People are going to the App Store, and hitting "Update" on the app to get the latest version.
Technically, I'm the oncall and another engineer the secondary. In practice, at 3am, at least 10 of us are online. This won't work.
"Let's make a schedule to have 24/7 coverage for the next 48 hours" - one of us proposes. "Some of us should get sleep now, while others cover. Then we swap. Sounds good?"
We agree and draft up the schedule. I take the shift until 9am. I doze off at 5am and fall into a dreamless sleep. When the alarm wakes me, I panic and reach to check if I've missed any PagerDuty notifications. I sigh with relief when I see that I haven't.
The next day and a half goes past slowly. We keep checking outage rooms, logs and online forums. Is anything else broken with payments?
After 48 hours, we don't see any other major issues. We launched. We didn't crash and burn. We made it! Somehow.
Helix - Uber's Rider app - rewrite was the most insane project I've worked on in my professional career. As crazy as it was, I owe lots of learnings and much of my Uber career growth to this project as well. Some of the learnings in my book, Building Mobile Apps at Scale: 39 Engineering Challenges - like adopting new languages and frameworks, also directly come from this project.
I joined Uber in July 2016, a few weeks after the project being kicked off. A month later, I found myself in the position of project lead on a team of 20 engineers building Payments, split across Amsterdam and San Francisco. We were behind schedule on a seemingly impossible timeline.
Somehow - focusing, cutting scope, thinking outside of the box, and every single person working on Helix giving 110% - we made it. Not just us, but all of Uber.
Considering the circumstances, the launch was a roaring success. Very few critical bugs were found: which was impressive, considering the YOLO release and how little time we had to test. Though it was not visible to me at the time, but other parts of the company like ops worked just as around the clock as engineering in testing the app and doing as much as they could in the ridiculously short timeframe we had.
In the first 24 hours of the launch, more than 1M people used the new iOS app in 60+ countries. We had planned for 10% of this traffic at most. On Android - where we were able to do a controlled, staged rollout - this number was well under 10,000.
The Objective C repo was deleted a week later, and teams were focused addressing the major issues of the new app, which were especially pronounced in countries outside the US. By Christmas, the app was stable, and - just as TK promised a year ago, in Las Vegas - a new Uber app was available worldwide.
Helix took a huge toll on most people. A few people quit during the project, but the majority of resignation notices came after we shipped the app. Many people did not even stick around a few months for March bonuses. Most of these resignations were in the US offices. However, one of my teammates in Amsterdam also quit because of the extreme stress of the project. During these months, working very long hours and over weekends became a norm and strongly encouraged by leadership, burning many people out.
I am doubtful I would have become an engineering manager if it was not for this project, though. More importantly, this project proved to me that all engineers can lead features/projects to completion very efficiently - and it shaped my approach of all engineers being leaders on a team, over time. It was this mentality that saved our part of the project on Payments: people around me stepping up, leading, executing, and helping each other. It was a crazy crunch - but one that at least did not feel lonely.
The project - and its learnings - had a visible impact within Uber. The next app rewrite - the Driver app rewrite - was the opposite of Helix: it was far more gradual. Deadlines were pushed back to accommodate for teams running behind. And the next language adoption - Kotlin - was significantly delayed exactly because of being burnt by moving to Swift too early, too fast. As of October 2020, Uber still did not adopt Kotlin, treading with extreme caution, still remembering the painful Swift lesson.
Other Helix Stories
Hundreds of engineers worked on this project, and everyone will have different perspectives and stories to tell. Here are quotes from other stories worth reading for this same project:
- "Alright folks, gather round and let me tell you the story of (almost) the biggest engineering disaster I’ve ever had the misfortune of being involved in." - McLaren Stanley, formerly engineer on Mobile Platform in this long Twitter thread that is a worthwile read.
- "There was an unfortunate side-effect to the problem we saw with Swift: we became very wary of new language adoption. When Android engineers wanted to adopt Kotlin, there was a lot of resistance to doing so." - Chris Brauchli, formerly on the Mobile Platform team in the excellent article Binary Size Woes.
- "I was part of the team in Amsterdam on this. Insane days. I think if we had not had some of the engineers we did at those exact times, we would have not gotten out of it." - an Uber engineer in a comment on the Hacker News thread for McLaren Stanley's thread.