Netflix/Amazon Outage December 24th

Note: I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportunity. Launched a small consulting business last year, to help out a few friends and my former employer after I left. Overall though, i’m hoping to at least blog weekly on some topic either personal, tech related, or just of overall interest.

Netflix Outage on Christmas Eve

I’m shocked at how much press attention the Netflix outage on Christmas Eve has gathered. Its not like this is the first time that either Amazon or Netflix has had an outage, nor will it be the last sadly. I think the large part of the scorn is that it hit Netflix at an unfortunate time when a lot of their users actually wanted to use the service for Christmas specials, holiday traditions, etc.

Most interesting of this is the blame that Amazon is taking for this outage and while they did cause the initial issue, the issue is Netflix’s to resolve and prevent future outages. (Amazon Blog Post on outage: http://aws.amazon.com/message/680587/) The cloud is fantastic, lowered the barrier to entry for a ton of startups, and provided Amazon scale to companies who wanted to focus on the software not the infrastructure. This power though comes with greater complexity and business challenges that must be addressed. (Spidermans: “With great power comes great responsibility” comes to mind).

When you rely on equipment, services, providers that you ultimately have little control over you must plan for failure. You must assume that any component that you rely on will be gone at any moment, or perform at a suboptimal level and there is nothing that you can do for these scenarios. At my last employer we hit several of the “amazon oops” moments. First we had our application deployed in only one region/Availibility zone. This is the same thing has running your all of your servers and systems in one datacenter, supported by a single provider/telco/utility/etc. Its a huge Single Point of Failure (SPOF).

Next we moved to single region/ multiple avaialibility zones, while this is a nice improvement it still bit us when an Amazon Technician made a human error that killed EBS and Storage in the East Region. Suddenly we realized that while Amazon advertises that each Availability Zone is agnostic from each other on a hardware level, the control tier and shared services back end could and does get shared across multiple AZ’s (Actually in reality some of the control tier is global in nature from some of the recent RCA’s).

It of course took code changes and infrastructure enhancements for us to tackle multiple AZ’s in a single region. Some of the things you may need to do in your applicaiton are:

Read/Write Master/Slave Replication strategy (NoSQL & Relational all have varying ways to accomplish this depending on your software)
Traffic load balancing between each AZ for incoming user traffic (if active/active)
Application awareness of databases and databases states. Some of this is handled by the drivers for the database (ie: Mongo), although implementations are a mixed bag and must be heavily tested

The above is an additional cost in complexity, testing, load testing, network design and thought to how the system is developed and deployed. Once you realize that this still isn’t good enough you start talking about multiple regions and/or cloud resilient (amazon & rackspace, etc). This adds new complexities that you now must factor in:

Global load balancing now becomes a must, or enhanced round robin DNS service
Increased latency between sites, syncronous commits become damaging to performance rapidly
Application complexity increases 3x with 2 Regions/Clouds, but expect 4-5x increase in complexity if you add more
Active monitoring and diagnoses of issues must be detected by monitoring and nodes/systems isolated as the number of users impacted could be small or large, or worse impossible to detect

I give lots of Kudos to Netflix for the Chaos monkeys, not a lot of people have the stomach to have a “rogue” agent in their systems breaking stuff on perfect and testing their resiliency. But as more and more companies move to the cloud the practice must become more common, at least in the lab environments. (http://techblog.netflix.com/2011/07/netflix-simian-army.html).

Global scale once the area of tech giants (Yahoo, Google, Microsoft, Amazon) are available to the masses. Of course finding the tech talent who has dealt with scale at this level is difficult and/or their pretty happy at the companies they work for. The Devops community is a huge help in this area, with folks sharing their infrastructure, war stories, solutions for scale, and of course a relentless pursuit if metrics and automation that allows the complexity of this scale to become manageable.

Check back for future posts on devops culture, hiring, global scale, etc!! Plus you guys can keep me honest on posting at least weekly.

Interested in the Netflix/Amazon outages check out these blogs:

Amazon Post Mortem

Adrian Cockcroft’s (Netflix) analysis of issue

Leave a Reply Cancel reply