You're so worried about AWS reliability, the cloud giant now lets you simulate major outages

Fake it 'til you break it, for a whole availability zone or WAN FAIL

re:Invent By The Register's count, Amazon Web Services has made at least 192 product announcements in the past four days at its re:Invent conference.

But only one has made your correspondent worry: news that Amazon customers have been asking it to offer simulations of what happens when an entire availability zone – a group of datacenters designed for extreme resilience – goes offline due to a power outage.

AWS already offers failure simulations with a tool called the Fault Injection Service (FIS). As the name implies, the service allows users to stage fake faults so they can test their ability to recover from the unexpected.

That sort of thing is routine for big hyperscalers: Netflix even created a tool called Chaos Monkey to randomly break its systems so its engineers could get better at effecting fixes.

Dr. Swami Sivasubramanian, VP of Data and AI, described the AWS Generative AI Stack

The AI everything show continues at AWS: Generate SQL from text, vector search, and more

MORE COVERAGE

FIS is cut from the same cloth. AWS suggests the service "makes it easier for teams to discover an application's weaknesses at scale in order to improve performance, observability, and resilience." Amazon's failure mode offers the chance to simulate an EC2 instance suffering stress on its disks, CPU, or memory. Users can also explore what happens when whole Kubernetes pods are deleted.

This week's announcement adds the ability to simulate power failure at an Amazonian availability zone (AZ).

The tests deliver simulated "loss of zonal compute (Amazon EC2, EKS, and ECS), no re-scaling of compute in the AZ, subnet connectivity loss, RDS failover, ElastiCache failover, and unresponsive EBS volumes. By default, actions for which no targets are found will be skipped." Even the symptoms a cloudy VM may experience when rebooting from a power outage are included.

Another novel failure test – the "The Cross-Region: Connectivity scenario" – lets users imagine what happens if links between AWS regions go down. Cloud regions, readers will recall, are clusters of datacenters. Hyperscale cloud operators typically advise customers to use multiple availability zones and regions to ensure resilience.

Which is where things get a little scary – AWS has described these FIS scenarios as "highly requested."

That presumably indicates a decent number of AWS customers feel the need to drill for major cloud outages.

Fair enough: AWS outages aren't frequent, but when they occur the blast radius can be enormous. A 2017 outage took out websites worldwide. In December 2021, three minor outages struck the US-EAST-1 Region, again causing disruptions.

By responding to customer requests with more comprehensive failure sims, AWS appears to be acknowledging concerns about the impact of big outages. Not to mention their seeming inevitability – despite Amazon's enormous efforts to ensure resilience. ®

More about

TIP US OFF

Send us news


Other stories you might like