Massive East Coast Internet Outage Pinned On Amazon Cloud Failure

Tyler Durden's picture

Update 2: according to the latest, at 11:35 AM PST: "We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue."

And on twitter:

A check of the Amazon's AWS status page shows a surge in "red" recent events.

* * *

Update: According to BGR, if it seems like your internet browsing is hitting more walls than possible today, you’ll be happy to know that it’s not your computer. A massive Amazon Web Services (AWS) outage is striking down lots and lots of web pages, leading to huge hiccups on a number of domains. Amazon is reporting the issue on its AWS dashboard, citing “Increased Error Rates,” which is a fancy way of saying that something is seriously broken.

Amazon Web Services is the cloud services arm of Amazon, and its Amazon Simple Storage Service (S3) is used by everyone from Netflix to Reddit. When it goes down — or experiences any type of increased latency or errors — it causes major issues downstream, preventing content from loading on web pages and causing requests to fail.

 

These instances are always a great reminder of how much of the internet relies on just a handful of huge companies to keep it up and running. An issue with Amazon’s S3 service creates a problem for countless websites that rely on their storage product to be up and running every second of the day. Unfortunately, there’s always ghosts in the machine, and downtime is inevitable. Let’s all pray that Amazon gets everything sorted out in short order.

Also moments ago Amazon provided the following S3 status update:

Update at 10:33 AM PST: We're continuing to work to remediate the availability issues for Amazon S3 in US-EAST-1. AWS services and customer applications depending on S3 will continue to experience high error rates as we are actively working to remediate the errors in Amazon S3.

* * *

Earlier:

A disturbance among several prominent websites, including Imgur and Medium to go offline, miss images or run slow, has been tracked to storage buckets hosted by Amazon's AWS, which while not reporting any explicit failures, has posted a notice on its service health dashboard, that it has identified "Increased Error Rates" and adds that "We've identified the issue as high error rates with S3 in US-EAST-1, which is also impacting applications and services dependent on S3. We are actively working on remediating the issue."

The abnormal state reportedly kicked off around 0944 Pacific Time (1744 UTC) today.

Among some of the services impacted are various YouTube-linked apps, as well as an ongoing outage at the SEC's own website, where as of this moment it is impossible to conduct public filing searches.

Many net participants have complained about the outage, which Amazon still refuses to fully acknowledge:

According to the register, one chief technology officer, reported that "we are experiencing a complete S3 outage and have confirmed with several other companies as well that their buckets are also unavailable. At last check S3 status pages were showing green on AWS, but it isn't even available through the AWS console."

Indicatively, Amazon had a similar "Increased Error Rate" event several years ago, which led to hard reboot and an outage which lasted for several hours. It is unclear if Vladimir Putin was blamed for that particular incident.

Amazon S3 Availability Event: July 20, 2008

 

We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.

At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.

 

At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests.

 

At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system's state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared. By 2:20pm PDT, we'd restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.

 

At 2:57pm PDT, Amazon S3's EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3's US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.

 

We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

 

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.

 

Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.

 

Sincerely,

 

The Amazon S3 Team

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
ParkAveFlasher's picture

All your data are belong to us.

Looney's picture

 

Amazon must’ve caught the "Increased Error Rates" from the Washington Post.

Jeff Bezos infected both of those with a Sexually Transmitted Error.  ;-)

Looney

Bastiat's picture

"we determined to shut down all communication between our S3 servers"

I'm sorry Dave, I can't do that.

Unreliable Narrator's picture

Somebody jammed pizza into the server.

Shemp 4 Victory's picture

In Pindostan already raised a tantrum. Putin is to blame.

beemasters's picture

Power/internet failures should be affecting cryptocurrencies too.

CClarity's picture

Cyber attack by White House on Bezos? What if this is payback to WaPo?

Cyber attack by Bezos on Trump?    What if his address can't be televised live tonight?

HowdyDoody's picture

The NSA would like to apologize for any loss of service during their 'upgrade' to the Amazon infrastructure. Also, please would you resend all emails you tried to send during the incident. We will make sure we catch them this time.

knukles's picture

 The Russians Hacked it You Dumb Asses
Hi!  I'm from Amazon and here to help ya'
         Don't try this at home
          We're Professionals
         Whatever that means  

E.F. Mutton's picture

Getting parts for a Commodore VIC-20 on short notice ain't easy you know

Joe Davola's picture

Oh, and Dave - I'm not self identifying as a server today.

The Saint's picture
The Saint (not verified) Looney Feb 28, 2017 2:47 PM

I've been trying to download a file off of S3.Amazonaws.com for the last hour.  Keep getting a "connection timed out" message.  Frustrating.

 

wildbad's picture

is this on the $600 Million "INVESTMENT" cloud that the CIA special ordered with Amazon Prime from Bezos as an entree to the Washington Post purchase?

THAT Cloud?

Erek's picture

I was just surfin' some porn and----------------------------------------------------------------------------------*

Saucy-Jack's picture

This is why we need to go cashless.

Money should always depend on the internet and electricity.

Just microchip us all in the name of safety.

What's the problem?

Raffie's picture

Could not bring up Amazon orders today.

Went to my Credit Union online and they had a error message about connection issues.

Paper Boy's picture
Paper Boy (not verified) Feb 28, 2017 2:38 PM

If you can't fdisk it, you don't own it!

Looney's picture

 

Speaking of “fdisk”…

Very few people know that it has the “/mbr” switch. It restores the Master Boot Record. Enjoy!  ;-)

Looney

Paper Boy's picture
Paper Boy (not verified) Looney Feb 28, 2017 2:47 PM

Yes, that is a good one to know.

wildbad's picture

/mbr means your data is gone...for ever

Looney's picture

 

There are quite a few viruses designed to delete or damage the Master Boot Record. THEN, all your data is gone. FDISK /MBR restores the MBR and all partitions and files “magically” re-appear.

FDISK /? Or FDISK /HELP don’t show the switch, but it’s always there, like a bad case of herpes.  ;-)

Looney

Bastiat's picture

Good to know.  But restores it as of when? How often does it update?

Curiously_Crazy's picture

No. It doesn't.

The MBR is just the boot sector at the beginning of a partition. All it does is point to where to load the OS - if you've ever done something as simple as install Linux you'll get an option to install GRUB (GNU grand unified bootloader) to the MBR. It's not rocket science and you're all way over thinking this. How do you think Windows sits alonside a Linux install?

It's trivial to restore your MBR and any teenage kid with a hint of any real computer skills knows it.

rejected's picture

Hell Looney,,, most today have no clue what fdisk is.

Bryan's picture

Now I really feel old.

Winston Churchill's picture

My punch card reader baffles my visitors.

ejmoosa's picture

I used to call IBM for support for my punch card reader.  It was so old the model number was only two digits.

They used to argue with me that it had to be more...

I believe it was a Model 19...

peddling-fiction's picture

Many people here would be suprised of what your punch cards have done.

JohnG's picture

They look at my acoustic coupler 300 baud modem and rotary dial telephone set as if they are relics.....

Erek's picture

What the DOS are you talkin' about? /s

buzzsaw99's picture

tax amazon repeatedly

peddling-fiction's picture

Tax servers because they are, um, robots.

Start with Microsoft servers as a test case scenario.

Thank Bill Gates for his great ideas.

risk.averse's picture

Tax...yeah right. How come Amazon was able to declare losses year after year for much of its existence? I know that accountants can be pretty creative but they must have bent/broken the rules somewhere/sometime??

Memo to Bezos: when you go ahead and automate your operation, firing thousands of employees you will lose the one moral lever in your armory: jobs for the masses. Make truckloads of money and employ people and folks will give you a pass. Dont "share the fruits" and sooner or later it'll bite you in the butt. Unless, of course, you can steer the political system so the masses aren't heard. You'll need to get into the mass-media to do that...oh wait, Bezos just did that :( ....d'oh!

E.F. Mutton's picture

"And that's how those pictures ended up in my account!" - J. Pedosta

Handful of Dust's picture

In very simple terms, what does the SEC have to do with Amazon Cloud? Are they storing their midget tranny photos on there?

Steaming_Wookie_Doo's picture

If the SEC is down, does anyone actually notice?

carbonmutant's picture

I don't think this is confined to Amazon, Stocktwits has been down all morning..

Dr. Engali's picture

Fucking Putin at it again!

konadog's picture

Yea - the MSM is slipping. Why doesn't the CIA teleprompter already have a narrative claiming this was Trump and the Russians?

esum's picture

AMZN runs the CIA cloud..... sweet crony $600 million deal

 

buzzsaw99's picture

making it easier to hack so it's actually a win win

Nobodys Home's picture

So ZH must be hosted on S1 or S2 huh?

peddling-fiction's picture

Good point. Too lazy today to check myself where ZH is in the Amazon jungle,

earlyberd's picture

Cloud services are particularly vulnerable to increasing error rates. That's a "feature" of building your clusters on top of commodity hardware; no more error checks performed in hardware, so you try make up for it with redundancy.

Seems like Amazon didn't allocate enough redundancy. Race to the bottom wins again!

ParkAveFlasher's picture

They'll make it up in volume.

rejected's picture

Love the Bullshit used to explain it all.

So all is good,,, go ahead, store your data in 'the cloud'.  Safe as a bug in a rug.

Mr. Pain's picture

And 7 minutes later the price of gold began its deep dive.