AWS Just Cannot Catch a Break

AWS Just Cannot Catch a Break

AWS had an outage in one of its data centers for the third time this month. A power outage in the US-EAST-1 area this morning impacted services like Slack, Asana, Epic Games, and others. The problems began at 7:30 a.m. ET and the fallout from them are still wreaking havoc on the service as of 1 p.m. ET, as AWS continues to report problems with a variety of services in this area, including its EC2 computing service and related networking activities. In this location, the single sign-on service has lately seen an upsurge in error rates.

“A single data center inside a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region has lost power,” the company said in an update at 8 a.m. ET. “This has an impact on the availability and connectivity of EC2 instances in the concerned data center and Availability Zone.” 

“For launches within the impacted Availability Zone, we are also seeing higher RunInstance API error rates. This problem has no effect on connectivity or power to other data centers within the impacted Availability Zone, or other Availability Zones within the US-EAST-1 Region, however, if you can, we suggest failing away from the affected Availability Zone (USE1-AZ4).”

It would have been unremarkable if this had been the only AWS outage in recent weeks. Because of the complexity of today’s hyper clouds, disruptions are unavoidable. However, AWS now experiences disruptions on a weekly basis. The same US-EAST-1 area was down for hours on December 7 owing to a networking fault. 

Then, on December 17, an outage affecting the connection between two of Amazon’s West Coast regions knocked off Netflix, Slack, and Amazon’s own Ring, among other services. To make matters worse, all of these disruptions occurred immediately after AWS praised its cloud’s resiliency at its re: Invent conference earlier this month.

Of course, in an ideal world, none of these outages would occur, and AWS users can protect themselves by architecting their systems to fail over to a geographically separate region — but this comes at a significant cost, so some users decide that the trade-off between downtime and cost isn’t worth it. At the end of the day, AWS is responsible for maintaining a reliable platform. 

While it’s difficult to determine if the organization is simply experiencing a run of poor luck or whether there are any underlying issues that have contributed to these troubles, if I were hosting service in the US-EAST-1 area right now, I’d at least consider transferring it somewhere.