Jump to content

How A Simple Command Typo Took Down Amazon S3 and Big Chunk of the Internet


tao

Recommended Posts

The major internet outage across the United States earlier this week was not due to any virus or malware or state-sponsored cyber attack, rather it was the result of a simple TYPO.

 

Amazon on Thursday admitted that an incorrectly typed command during a routine debugging of the company's billing system caused the 5-hour-long outage of some Amazon Web Services (AWS) servers on Tuesday.

 

The issue caused tens of thousands of websites and services to become completely unavailable, while others show broken images and links, which left online users around the world confused.

 

The sites and services affected by the disruption include Quora, Slack, Medium, Giphy, Trello, Splitwise, Soundcloud, and IFTTT, among a ton of others.

 

Here's What Happened:

 

On Tuesday morning, members of Amazon Simple Storage Service (S3) team were debugging the S3 cloud-storage billing system.

 

As part of the process, the team needed to take a few billing servers offline, but unfortunately, it ended up taking down a large set of servers.

 

    "Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended," Amazon said. "The servers that were inadvertently removed supported two other S3 subsystems." …Whoops.

 

As for why it took longer than expected to restart certain services, Amazon says that some of its servers have not been restarted in "many years."

 

Since the S3 system has experienced massive growth over the last several years, "the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."

 

The company apologized for the inconvenience faced by its customers and promised that it will be putting new safeguards in place.

 

Amazon said the company is making "several changes" as a result of this incident, including steps to prevent an incorrect input from triggering such problems in the future.

 

The typo that caused the internet outage this week also knocked out the AWS Service Health Dashboard, so the company had to use its Twitter account to keep customers updated on the incident.

 

Due to this, Amazon is also changing the administration console for the AWS Service Health Dashboard, so that it can run across multiple regions.

 

Here  >

 

Link to comment
Share on other sites


  • Views 846
  • Created
  • Last Reply

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...