Jump to content

Cloudflare comes clean on crashing a chunk of the web: How small errors and one tiny bit of code led to a huge mess


steven36

Recommended Posts

The culprit? .*(?:.*=.*)

 

https://s7d2.turboimg.net/sp/fcbcbdee468c8a9e21ebc215778ec2c7/mistake.jpg

 

Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big wedge of the internet with it.

 

We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global outage had been caused by an error in a single line of code in a system the company uses to push rapid software changes.

 

Even though that change had been run through a test beforehand, the blunder maxed out Cloudflare's servers CPUs and caused customers worldwide to get 502 errors from Cloudflare-backed websites. The full postmortem digs into precisely what went wrong and what the biz has done and is doing, to fix it and stop any repetition.

 

The headline is that it was a cascade of small mistakes that caused one almighty cock-up. We're tempted to use the phrase-du-jour "perfect storm," but it wasn't. It was a small mistake and lots of gaps in Cloudflare's otherwise robust processes that let the mistake escalate.

 

First up the error itself – it was in this bit of code: .*(?:.*=.*). We won't go into the full workings as to why because the post does so extensively (a Friday treat for coding nerds) but very broadly the code caused a lot of what's called "backtracking," basically repetitive looping. This backtracking got worse – exponentially worse – the more complex the request and very, very quickly maxed out the company's CPUs.

 

So the three big questions: why wasn't this noticed before it went live? How did it have such a huge impact so quickly? And why did it take Cloudflare so long to fix it?

 

The post answers each question clearly in a detailed rundown and even includes a lot of information that most organizations would be hesitant to share about internal processes and software, so kudos to Cloudflare for that. But to those questions…

I see you CPU

The impact wasn't noticed for the simple reason that the test suite didn’t measure CPU usage. It soon will – Cloudflare has an internal deadline of a week from now.

 

The second problem was that a software protection system that would have prevented excessive CPU consumption had been removed "by mistake" just a weeks earlier. That protection is now back in although it clearly needs to be locked down.

 

The software used to run the code – the expression engine – also doesn't have the ability to check for the sort of backtracking that occurred. Cloudflare says it will shift to one that does.

 

So that's how it got through the checking process: what about the speed with which it impacted everyone?

 

Here was another significant mistake: Cloudflare seems to have got too comfortable with making changes to its Web Application Firewall (WAF). The WAF is designed to be able to quickly provide protection to Cloudflare customers – it can literally make changes globally in seconds.

 

And Cloudflare has in the past put this to good use. In the post, it points to the fast rollout of protections against a SharePoint security hole in May. Very soon after the holes were made public, the biz saw a lot of hacking efforts on its customers' system and was able to cut them off almost instantly with an update pushed through WAF. This kind of service is precisely what has given Cloudflare its reputation – and paying clients. It deals with the constant stream of security issues so you don't have to.

 

But it uses the system a lot: 476 change requests in the past 60 days, or the equivalent of one every three hours.

 

The code that caused the problem was designed to deal with new cross-site scripting (XSS) attacks the company had identified but – and here’s the crucial thing – it wasn't urgent that that change be made.

 

So Cloudflare could have introduced it in a slower way and noticed the problem before it became a global issue. But it didn't; it has various testing processes that have always worked and so it put the expression into the global system – as it has with many other expressions.

 

Cloudflare justifies this by pointing to the growing number of CVEs – Common Vulnerabilities and Exposures – that are published annually.

 

War Games redux

 

The impact however was that it created an instant global headache. What's more the code itself was being run in a simulation mode – not in the full live mode – but because of the massive CPU consumption that it provoked, even within that mode it was able to knock everything offline as servers were unable to deal with the processing load.

 

That's where it all went wrong. Now, why did it take Cloudflare so long to fix it? Why didn't it just do a rollback within minutes and solve the issue while it figured out what was going on?

 

The post gives some interesting details that will be familiar to anyone that has ever had to deal with a crisis: the problem was noticed through alerts and then everyone scrambled. The issue had to be escalated to pull in more engineers and especially more senior engineers who are allowed to make big decisions about what to do.

 

The mistakes here are all human: first, you have to physically get other human beings in front of screens, on phones, and in chatrooms. Then you have to coordinate quickly but effectively. What is the problem? What is causing it? How can we be sure that's right?

 

People get panicky under pressure and can easily misread or misunderstand the situation or decide the wrong thing. It takes a cool head to figure out what the truth is and figure out the best way to resolve it as quickly as possible.

 

It appears from Cloudflare's post that the web biz actually did really well in this respect – and we can have some degree of confidence in its version of events thanks to the timeline. Despite the obvious initial thought that the company was under some kind of external attack, it pinpointed the issue as being the WAF within 15 minutes of receiving the first alert. Which is actually a pretty good response time considering that no one was watching this rule change. It was a routine update that went wrong.

 

But there were several crucial delays. First the automated emergency alerts took three minutes to arrive. Cloudflare admits this should have been faster. Second, even though a senior engineer made the decision to do a global kill on the WAF two minutes after it was pinpointed as the cause of the problem, it took another five minutes to actually process it.

 

Slow death

 

Why? Because the people authorized to issue the kill hadn't logged into the system for a while and the system's protection system had logged them out as a result. They had to re-verify themselves to get into the system. When they did and authorized the kill, two minutes later it had kicked in globally and traffic levels went down to normal – making it clear that it was in fact the WAF that was the problem.

This is the timeline:

 

  • 13.42: Bad code posted
  • 13.45: First alert arrives (followed by lots of others)
  • 14.00: WAF identified as the problem
  • 14.02: Global kill on WAF approved
  • 14.07: Kill finally implemented (logging in)
  • 14.09: Traffic back to normal

 

Cloudflare has changed its systems and approach in response so in future this response time should go from 27 minutes to around 20 minutes (assuming it will always take some amount of time to figure out where the problem lies in a previously unidentified issue.)

 

At this point, the problem was identified but WAF had been taken down so people were still experiencing problems. The Cloudflare team then had to figure out what in WAF had gone wrong, fix it, check it, and then restart it. That took 53 minutes.

 

This is where the impressive openness and honesty from Cloudflare up until this point gets a little more opaque. One paragraph covers this entire process:

 

"Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location. At 14:52 we were 100 per cent satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally."

 

There's no more information than that, although it does mention later on that "the rollback plan required running the complete WAF build twice, taking too long."

Timing off

It also mentions that the Cloudflare team "had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on" – although it's not clear if that leads to delays in fixing the WAF.

 

It's hard to know without more detail whether Cloudflare did a great job here or whether its systems were found lacking - given its global reach and that it's entire function as a company is around this kind of work.

 

For example: how long after the WAF was taken down did the engineer manage to pinpoint the specific code that caused the problem? Did it figure it out in five minutes and then run 47 minutes of tests? Or did it take them 47 minutes to find it and run five minutes of tests?

 

The fact that Cloudflare doesn't say in an otherwise very detailed and expansive post suggests that this was not its finest hour. You would imagine that it would simply bring up a log of all the changes made just prior to the problems, cut those changes out, rebuild, and test. Maybe it did.

 

Is 53 minutes a good timeframe to rebuild something that had just caused worldwide outages and put it live again? What do Reg readers think?

 

Anyway, that's how it went down. To its credit, Cloudflare also acknowledges that its communication during the crisis could have been better. For obvious reasons, all of its customers were clamoring for information but all the people with the answers were busy fixing it.

 

Worse, customers lost access to their Cloudflare Dashboard and API - because they pass through the Cloudflare edge which was impacted – and so they were really in the dark. The business plans to fix both these issues by adding automatic updates to its status page and by having a way to bypass the normal Dashboard and API approach in an emergency, so people can get access to information.

 

So there you have it. It's not clear how much an impact this cock-up has had on people's confidence with Cloudflare. The post is keen to point out the company hasn't had a global outage in six years – not including Verizon-induced problems of course.

 

Its honesty, clear breakdown and list of logical improvements – including not posting non-urgent updates to its super-fast global update system - will go some way to reassure customers that Cloudflare is not going all-Evernote and building more and more services on top of sub-optimal code.

 

With luck it will be another six years until the Cloudflare-reliant internet goes down.

 

Source

 

Link to comment
Share on other sites


  • Views 541
  • Created
  • Last Reply

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...