CircleCI’s unnoticed holiday security breach

Originally published on 5 January 2023.

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of seven topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe here.

For most engineering teams, returning from the winter holiday usually involves gradually getting back into the swing of things. But today has been anything but a soft landing for customers of CircleCI, who got the email you always dread from a vendor, suggesting CircleCI was compromised and customers' secrets could have been accessed by an attacker during the past two weeks.

Here’s the email CircleCI customers received late on Wednesday 4 January. Emphases are original to the email:

“Action request:

Out of an abundance of caution, we strongly recommend that all customer take the following actions:

Immediately rotate any and all secrets stored in CircleCI. These may be stored in project environment variables or in contexts.

We also recommend customers review internal logs for their systems for any unauthorized access starting from December 21, 2022, through today, January 4, 2023, or upon completion of your secrets rotation. (...)We apologize for any disruption to your work. We take the security of our systems and our customers’ systems extremely seriously. While we are actively investigating this incident, we are committed to sharing more details with customers in the coming days.

Thank you for your urgent attention to rotating your secrets.”

Affected customers are predictably unhappy. The first response has been frustration because of the chaos a breach like this causes: At a scaleup I talked with, infrastructure teams shut down all pipelines in order to replace secrets. This company decided to make the best of a bad situation by taking a few days to implement a secrets rotation mechanism not previously in place.

The bigger issue is that customers could have been compromised for up to two weeks, and it’s all on them to work out if malicious actors could have accessed their systems via leaked CircleCI credentials.

Why did CircleCI take so long to detect a security incident?

I can only speculate, but it’s fair to assume that starting on 21 December, people at CircleCI, just like at most companies, were already starting to take holidays, and no one expected anything important to happen. If an attacker was to identify the best time to breach a system and go undetected for as long as possible, then the longest annual period of downtime would be it.

It’s not far-fetched to suspect that the CircleCI team discovered the issue either on 2 or 3 January, and then sounded the alarm to customers.

The breach comes two weeks after CircleCI laid off 17% of staff. I don’t want to draw any conclusions, but it’s worth noting that CircleCI let go 17% of staff – around 100 people – on 7 December 2022, two weeks before the apparent start of the incident. I asked CircleCI if they let any of their security team go as part of layoffs but the company did not answer my question.

The CEO of the company wrote at the time of the redundancies (emphasis mine):

“Despite today’s news, we’re confident in our business. I believe we have the right strategy to succeed in the long term. We have the best CI/CD platform on the market by leaps and bounds. We see that success reflected in the continued growth of our business, adding thousands of high-performing engineering teams to our platform over the past 12 months. Our customers are some of the most innovative, engineering-centric businesses on the planet, and helping them do great work will continue to be our focus.”

Today’s news on the breach deals a major blow to CircleCI’s market position. This is because customers expect two things from a CI solution:

To do its main task, which is build efficiently
To remain secure, as source code is one of the most valuable assets any software company owns

CircleCI customers were already targeted by phishing attacks in September 2022. The company made customers aware of a phishing attack in which attackers stole GitHub credentials. However, the notified companies were given this information under NDA and not allowed to share it externally, one of them told me. This business decided to move off CircleCI because of this incident.

Weeks later, the incident became public knowledge and was reported by The Hacker News. Attackers gained access to GitHub accounts of organizations. While this attack is clearly not a breach of CircleCI, the fact customers could be tricked into giving access to what they thought was CircleCI indicated a vulnerability.

I reached out to CircleCI for comment, and asking for more details about the September phishing attacks that customers were under NDA to not share anything about, but the company just gave a generic response on how they take “take customer security and privacy extremely seriously” and pointed to their announcement, asking for time to provide more details in “the coming days.”

Security incidents have plagued infrastructure providers over the past year or so. Travis CI used to be a popular CI solution and a competitor to CircleCI. However, TravisCI was compromised in 2021, then again in the summer of 2022 and then yet again last December. TravisCI has been terrible at communicating with customers and at this point, in my view, it’s the least trustworthy CI provider on the market.

Heroku was compromised in April 2022, with GitHub security tokens stolen by an attacker. What was really alarming about this situation was that Heroku’s security team did not notice; it was GitHub’s security team which alerted Heroku of suspicious activity with GitHub tokens.

I talked with people at GitHub at the time who told me that, after getting more details from Heroku, the GitHub security team assessed Heroku as a major security risk due to the responses from the Heroku security team. GitHub immediately instructed the use of any Heroku products within its business to stop, and for staff to assume a full compromise on all Heroku keys. Some smaller sites were running on Heroku at the time, and they were turned off, with the secrets and passwords not reused elsewhere.

It was a month later in May 2022, when Heroku notified customers of the extent of the security breach, writing:

“As reported on status.heroku.com, on April 7, 2022, a threat actor obtained access to a Heroku database and downloaded stored customer GitHub integration OAuth tokens. On that same day, the threat actor downloaded data from another database that stores pipeline-level config vars for Review Apps and Heroku CI. This was identified on May 16, 2022, after further forensic investigation. We have no evidence of any unauthorized access to Heroku systems since April 14, 2022.

As a result, any secrets you set in Review Apps and Heroku CI config vars may have been compromised and should be rotated. In addition, any Heroku tokens stored in these pipeline config vars would potentially have allowed access to your Heroku account between April 7, 2022 and May 5, 2022, when your passwords were reset, invalidating all Heroku tokens as a result.”

In my view, Heroku handled this security incident extremely poorly. For example, in its public incident update, Heroku did not mention that Heroku tokens stored in pipeline vars were compromised, so it seems as if they tried to put a positive gloss on the public-facing incident.

So, how much can you trust your CI provider or your infrastructure provider to not leak secrets? This is a very good question. Over the past 12 months, Travis CI, Heroku and CircleCI have all failed at keeping customer’s secrets – and therefore their codebases – safe.

I expect that platforms which design themselves according to a security-first approach will start to gain popularity. It is becoming increasingly risky to give a CI system full access to your code repository, in the way CircleCI or TravisCI expect.

An example of a provider offering more thoughtful security is Buildkite, which does not store source code or secrets, as customers run their Buildkite agents on their own cloud infrastructure, behind their own security policies. I reached out to BuildKite’s CEO, Keith Pitt, to ask why they chose this model. He said:

“Security is a thing that keeps me up at night. If a CI/CD service gets popped badly, then it’s basically game over. You get one life with this kind of thing.

Having said that: we do have access to GitHub webhook data which includes things like commit messages, branch names, locations of changed and added files. And of course, whatever logs are streamed to us from connected agents.

Security became a big selling point for us. I’d estimate that about 50% of our big customers come to us because of security or performance issues with their previous provider.”

GitHub’s self-hosted runners follow a similar approach on hosting build runners in the company’s own environment, limiting the exposure and leakage of secrets and source code, outside of the company running it.

Now is a great time to do a pre-mortem on what could happen if your CI provider gets hacked, and your secrets leaked. Even if you’re not a CircleCI, TravisCI or Heroku customer, don’t assume your provider won’t suffer a security breach. What sensitive data could be leaked if this happened, and how badly would it affect your business? Do you have measures in place like secret rotation to minimize the damage?

This is not just a theoretical exercise; all of Slack’s code repositories hosted on GitHub were downloaded by a malicious actor during the winter break. Slack employee tokens were stolen and then the attacker logged into GitHub and made a copy of all of Slack’s code repositories. Slack detected this breach on 29 December and communicated it on 31 December. Note that Slack did not have its primary codebase hosted on GitHub, but stored that code on an internal repository which was not compromised.

Do you want to keep relying on your vendor to keep your codebase secure, or would you rather choose a setup for storing your source code, with CI running inside your own infrastructure?

This was one out of the seven topics covered in this week’s The Scoop. A lot of what I share in The Scoop is exclusive to this publication, meaning it’s not been covered in any other media outlet before and you’re the first to read about it.

The full The Scoop edition additionally covers:

Twitter’s first major outage since Musk’s takeover. What caused Twitter’s partial outage lasting hours on 28 December? I’ve talked with people involved in the incident to find out if it may have been connected to Elon Musk personally disconnecting server racks on Christmas eve. Exclusive.
Amazon to lay off 18,000 people and rescinding more offers. Things are looking grim at Amazon, with the online retailer announcing that layoffs will be larger than was communicated in December. Meanwhile, signed offers keep being rescinded. Could this be a signal of more cost cutting to come across Big Tech? Analysis.
Hiring freezes. Databricks, Asana and Spotify are slowing or freezing hiring: details. Exclusive.
Layoffs picking up with the new year. The first tech layoffs of the year are already in with Salesforce, Vime and Socure. Can we expect these to continue, or is the worst behind us already? My opinion. Exclusive details on Socure and analysis.
Microsoft starting to exit China? The tech giant is spinning up a new office in Vietnam. But why would the company begin to leave China? And would other US companies be wise to follow in Microsoft’s footsteps? Exclusive.
SpaceX valued at $142B? The company is raising $750M of fresh funding and I have more details on revenue numbers, projections, the likely date of an IPO and why Starlink looks like a really big deal. Exclusive.

Read the full The Scoop

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 software engineering newsletter on Substack.

Menu