The Software Development Dilemma: Move Fast without Breaking Things?

One of the most famous Facebook slogans is Move Fast and Break Things. This is how Pedram Keyani thinks back about what this meant at Facebook back in 2007:

I’ll never forget my second day on the job when Chris Cox, our V.P. of Product, assigned me the task to fix a pretty complex bug in News Feed. I'd never written PHP in my life, but a few hours later when I got my code working and ready for review, he said "Cool, ship it!" with a big smile on his face. At first, I didn't think he was serious, but once it shipped, I realized that I had just pushed out a change on my second day that 25 million people would experience instantly.

For any engineer - including myself - this is as good as it gets. Finally, I can just focus on what I do best and worry about the rest after shipping. Where do I sign up?

... Except Things Do Break

It's really tempting for any team to follow the Facebook way. Why don't we just iterate fast, ship product quickly and beat our competition by out speeding them?

In reality, more often than not moving fast and breaking things will result in shipping scrappy software. This is because in the rush to get stuff out the fastest way time-consuming things get skipped. Like user testing, automation, analytics, monitoring, manual testing - just to name a few.

These bugs will keep annoying people using and relying on the software. It might just be a little bug here, another one there, but they all add up. A good example is the how food delivery apps have tons of small glitches that make their couriers' job pretty difficult. This is how Anna Vignet talks about her experience delivering door to door in San Francisco:

I’ve been delivering for Postmates and DoorDash, but glitches in their apps have created unnecessary buckets of my sweat. (...)

The app dispatches bicyclists on short-distance orders but doesn’t account for the grade of the streets. (...)

Sometimes the app seemed to forget that I needed to bike back to the hotspot. After delivering Mexican food to a neighborhood on the west side of the city, the app immediately buzzed, asking me to pick up barbecue, and deliver it to the east side within 32 minutes — a 4.5-mile ride. That seemed impossible. (...)

Given that these companies can’t deliver food without us, couriers, it’s remarkable how unfinished their support services are.

To be honest for startups software having bugs are probably not a deal breaker early out. But it's not just startups who follow this kind of thinking, it's almost everyone in the industry. Scott Hanselman brought this topic up a couple of years ago and in a recent post he doesn't see things getting better:

I posted a rant back in 2012 called "Everything's broken and nobody's upset." (...) Fast forward a few years, and I think that we as an industry are perhaps still headed in the wrong way. Technology companies are outsourcing QA to the customer and we're doing it using frequent updates as an excuse.

The App Store make it easy to update apps. Web Sites are even worse. How often have you been told "clear your cache" which is the 2015 equivalent to "did you turn it on and off again?"

It's too easy to ship crap and it's too easy to update that crap. When I started in software we were lucky to ship every 6 to 9 months. Some places ship every year or two, and others still ship once. I see folks misusing Scrum and using it as an excuse to be sloppy.

How Can We Stop Breaking Things?

A lot of the software we use every day is broken one way or the other and most people seem to accept this. As engineers, surely we can do better than this.

A very simple solution could be to jump back in time and ship less frequently - but jokes aside being slow is no longer an option. Shipping fast is the new normal, like it or not. Would you be eager to work at a place that has "Move slow and don't break things" painted on the wall? Yeah, neither would I.

Identify Things You Cannot Break

Zach Holman has a great post about how at GitHub they Move Fast and Break Nothing. He thinks there is one thing you should do first and foremost:

I think move fast and break things is fine for many features. But the first step is identifying what you cannot break. These are things like billing code (...), upgrades (hardware or software upgrades can always be really dicey to perform), and data migrations.

Once you've identified these areas, it's a whole massive challenge on how to not break these critical parts in the software, and still make fast and small changes. Like changing the engine while a car is running, there is no simple and one size fits all solution to this.

One interesting approach GitHub are doing is parallel code paths. In a nutshell, the idea is running both the old and the new code, and only switching to the new code one if it performs at least as good as the old one.

Move Fast With Stability

Coming back to Facebook - turns out they also had to go back on the whole "Break Things" part. In 2014 Mark Zuckeberg said their new motto is "Move Fast With Stable Infra":

What we realized over time is that it wasn't helping us to move faster because we had to slow down to fix these bugs and it wasn't improving our speed.

What this means is more automated tests, better monitoring and other infrastructure to help prevent and catch bugs as early as possible. I have to agree as this being one of the best ways to go about enabling to move fast.

I would go as far as calling this infrastructure debt - early on, as a startup, it's usually okay to have more of it. But once you have more users, if you don't want to upset them with bugs a lot of the time, then perhaps you should start to change your mentality to something similar.

Be Aware of What You Break

Identifying what not to break and moving fast with stability are all good directions to take. But let's not forget about something basic: if you don't know it's broken - You Ain't Gonna Fix It (YAGFI*).

Whatever software you ship do make sure you can detect when your software breaks. Log errors on the server, send crash reports on apps, collect timeouts on web apps. These are pretty basic and most engineers I know do these from day zero.

What many people don't do, but should, are two things. First, they don't alert the team when a part of the system breaks, that cannot break. The whole point of identifying those critical parts of your software is that you drop everything when something goes wrong there, fix it, and learn from it.

The other, just as important thing few products do well is make it easy for users to report bugs. How easy is it for people to send you bug reports on your website, app or service? For most products it's really difficult. If your product is one of these, consider using or replicating something like Usersnap or Zendesk. And you should take user reported bugs seriously. A complaint of someone not being able to make a payment should be treated the same way as your payment server returning a 500 response.

Move Fast and Be Smart About It

I remember many years ago when the code I wrote took 6-12 months to ship out. The projects I work on these days is usually no more than a week for iOS and as little as hours on the web.

More and more companies are proving that it is possible to move fast without sacrificing much quality.It is time for us engineers to step up to this challenge and find ways of moving fast and not breaking things.

*YAGFI - this could very well be the opposite of YAGNI.

comments powered by Disqus