How a Tiny Bug Caused Massive AWS Outage Affecting Netflix, Starbucks

Key Takeaways

A minor bug involving two automated systems caused a massive AWS outage
The DNS failure created a cascading effect that took down major services
Companies including Netflix, Starbucks and United Airlines were affected
Amazon has implemented fixes to prevent similar incidents

A massive AWS outage that disrupted popular apps and services worldwide originated from a simple bug where two automated systems tried to update the same data simultaneously. What began as a minor glitch escalated into a serious incident that required emergency intervention from Amazon engineers.

The Domino Effect Explained

The outage prevented users from ordering food, accessing hospital networks, using mobile banking, and connecting to security systems. Major global companies including Netflix, Starbucks and United Airlines temporarily lost their online services.

Amazon acknowledged the significant impact in a statement: “We apologize for the impact this event caused our customers. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”

Technical Breakdown: The DNS Phonebook Failure

The core issue involved two programs competing to write the same DNS entry – essentially a record in the internet’s phonebook – simultaneously, resulting in an empty entry that disrupted multiple AWS services.

Angelique Medina, head of Cisco’s ThousandEyes Internet Intelligence, explained: “The analogy of a telephone book is pretty apt in that the folks on the other line are there, but if you don’t know how to reach them, then you have a problem. And that telephone book effectively went poof.”

Classroom Analogy Makes Complex Issue Understandable

Indranil Gupta, a University of Illinois professor, used a classroom example to illustrate the technical problem. He described two students – one fast, one slow – collaborating on a shared notebook.

The slower student “pays attention in brief bursts, but their work may conflict or contradict the work of the faster student,” while the quicker student constantly tries to “fix” things quickly and deletes the slower student’s outdated work.

“The result… an empty page (or crossed out page) in the lab notebook, when the teacher comes and inspects it,” he wrote.

Cascading Failures and Recovery Challenges

This “empty page” scenario brought down AWS’ DynamoDB database, creating a chain reaction that affected EC2 virtual servers and Network Load Balancer services. When DynamoDB recovered, EC2 struggled to bring all servers online simultaneously.

Amazon’s Response and System Improvements

Amazon has implemented several changes including fixing the “race condition scenario” that caused the systems to overwrite each other’s work, and adding additional testing for EC2 services.

Professor Gupta noted that while such large-scale outages are rare, they’re inevitable in complex systems. “Large scale outages like this, they just happen. There’s nothing you can do to avoid it, just like how people get ill,” he told CNN. “But I think how the company reacts to the outages and keeps customers informed is really, really key.”

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

Key Takeaways

The Domino Effect Explained

Technical Breakdown: The DNS Phonebook Failure

Classroom Analogy Makes Complex Issue Understandable

Cascading Failures and Recovery Challenges

Amazon’s Response and System Improvements

Topics

Related Articles

Categories

Latest

Newsletter