Amazon outage started small, snowballed into 12-hour event

By Brandon Butler, Network World |  Cloud Computing, Amazon Web Services

Amazon Web Services has almost fully recovered from a more than 12-hour event that appears to have started by only impacting a small number of customers but quickly snowballed into a larger issue that took down major sites including Reddit, Imgur and others yesterday.

AWS has not yet said what caused the failure, but the company posted frequent updates throughout the day. It noted a number of times that customers who have architected their systems according to AWS's best practices of spreading workloads across multiple availability zones were less likely to have experienced issues.

AWS GOES DOWN: Amazon EBS failure brings down Reddit, Imgur, others

WE'VE BEEN HERE BEFORE: Amazon outage one year later: Are we safer?

AWS first reported an issue shortly before 11 a.m. PT on Monday when it said a "small number" of Elastic Block Storage (EBS) volumes in a single availability zone in the US-East-1 region were experiencing degraded performance. EBS is a block storage service used in conjunction with Elastic Compute Cloud (EC2).

About an hour later, AWS took away the language noting that only a "small" number of customers were being impacted. By 2:20 p.m. PT, AWS said it restored about half of the impacted volumes, and noted that customers who used multiple availability zones should not have been affected, which AWS has preached in the past.

While AWS continued to restore impacted EBS volumes throughout the afternoon, around 6:30 p.m. a subsequent issue seems to have arisen when AWS reported elevated error rates for associating IP addresses from Elastic Load Balancers (ELBs), which was resolved about an hour later. ELBs transfer workloads within a system or across multiple AZs.

By early today, the latest status updates report that AWS has reached out via email to certain customers who are still being impacted by the event and may have to take action. Other customers may experience increased volume input/output (I/O) latency as the EBS volumes continue a re-mirroring process throughout the day.

EBS volumes weren't the only service impacted during yesterday's outage, though. The Relational Database Service (Amazon RDS) also went down for a "small number" of customers shortly after 11 a.m. PT on Monday, which was mostly recovered about two hours later. As of 4 a.m. PT on Tuesday, AWS reported that it was still progressing to bring back full functionality to RDS.

Like with the EBS issue, AWS reminded customers that if they enabled Point-in-Time Restore option, then they could launch a new database instances using a backup of the impacted database in another availability zone.


Originally published on Network World |  Click here to read the original story.
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Spotlight on ...
Online Training

    Upgrade your skills and earn higher pay

    Readers to share their best tips for maximizing training dollars and getting the most out self-directed learning. Here’s what they said.

     

    Learn more

Cloud ComputingWhite Papers & Webcasts

See more White Papers | Webcasts

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Ask a Question