Amazon’s cloud-computing outage on Wednesday was triggered by effort to boost system’s capacity

Within a few hours, the malfunctions began hitting customers of Amazon Web Services, the company’s cloud-computing unit. Customers of the Amazon-owned Ring security camera service couldn’t log in or watch video. Users struggled to operate their iRobot vacuum cleaners because the outage affected the iRobot Home app. And media companies, including The Washington Post (owned by Amazon founder and chief executive Jeff Bezos), experienced publishing system outages.

Amazon acknowledged that the system failure was exacerbated by the co-dependencies its various services have on one another. The company had been trying to add capacity to its Amazon Kinesis service that customers use to process real-time data including video, audio and application logs. To resolve the issue, Amazon needed to restart a piece of its system it described as “many thousands of servers,” a lengthy process that had to be done gradually. But because other Amazon cloud services rely on Kinesis, including its Cognito authentication offering, they failed as well.

And because Amazon uses Cognito itself to let customers know about the status of its cloud operations through its Service Health Dashboard website, it couldn’t immediately update that site. The company has a backup method to update the site, but said “it is a more manual and less familiar tool for our support operators.”

An Amazon spokeswoman didn’t respond Saturday to a request for comment about the outage. In the blog post, the company pledged to do “everything we can to learn from this event.”

The failure of its service underscores a danger of only a handful of vendors managing global cloud computing. Amazon held 45 percent of the global market in 2019, according to the market research firm Gartner. In addition to Ring and iRobot, Amazon’s customers include Netflix, BP and Capital One, all of which run significant pieces of their computing operations on AWS.

Source: WP