Instructive Lesson From The Amazon S3 Outage

March 10, 2017

On Tuesday, February 28th, 2017, an Amazon employee made a typo during a routine billing system repair and, rather than taking only a few servers offline, instead took down a significant (but still unspecified) number of the nearly 150,000 unique websites and apps hosted or supported by the Amazon cloud computing S3 service.

Affected sites and apps across the web

Around noon EST, users noticed that websites including Netflix, Spotify, Pinterest, Buzzfeed, Slack, Github, Trello, Venmo, Quora, Sailthru, Business Insider, and Giphy, among others, were all unavailable or inconsistently responsive. But the downtime didn’t end there: Since Amazon supports many “Internet of Things” home devices, thermostats controlled via Nest and other connected appliances were similarly unavailable. The issue was resolved after a four-hour window; however, since Amazon’s own monitoring website was hosted on an impacted server, the accurate information did not disseminate uniformly.

A harbinger of issues to come?

Amazon web services (AWS) is the leading cloud service provider, with 40% of the worldwide market share of cloud services currently in use, for clients as small as individuals and as large as multinational corporations. AWS accounts for 8% of Amazon’s total yearly revenue. Anytime that much data is organized and administered through one hub, vulnerability to disruption exists. At the time of the outage, the cause wasn’t immediately released but did turn out to be simple human error.

Unlike bugs in a code, hardware failures, or even weather-related issues, having more redundant servers and services won’t alleviate the risks present when people get involved. Barring the impending perfection of artificial intelligence, there is no way to eliminate humanity from the technology world, only to minimize the potential for disasters. In their statement on the incident, Amazon has announced that they are “making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards….”

Cautions and positivity

There are encouraging results of this unfortunate and costly outage. Amazon was able to use other, unaffected channels to communicate the extent—and progress towards resolution—of the problem, namely Twitter. This incident also highlights the real risks that can come when a company employs only one cloud-based file system: even when distributed physically, systemic infrastructure can still be a hazard. The solution might be to disperse files among various cloud sharing systems entirely, from AWS to Google and others. The costs for most companies to make this type of move can be prohibitive, but so too are the costs of loss of revenue during downtime as was just experienced.

Want to See How Volico Data Center Can Help You?

If you are concerned about how an incident like this could impact your business, the time to plan for such a disruption is now. Contact Volico at 888-865-4261. A member of our team is standing by to assist you in determining what type of solutions would benefit your IT operations.

Ready to See How Volico Data Center Can Help You?
Got questions? Want to talk specifics? That’s what we’re here for.
Have one of our friendly experts contact you to begin the conversation. Discover how Volico can help you with your Cloud Hosting needs.
• Call: 888 865 4261
• Chat with a member of our team to discuss which solution best fits your needs.