Amazon Web Services Deployment The Right Way
In general, when considering all things “cloud” it’s healthy to retain a skeptical mind set and avoid succumbing to hype. But the fallout from the recent Amazon Web Services (AWS) outage is actually a very positive sign for Cloud Computing. Sure some sites got taken completely down, including a favorite of many, Quora. However, another popular site managed to survive the incident with comparatively minor hiccups: Netflix. This is the bright spot the cloud community should examine. As with many other leading websites, Keynote monitors the performance of certain transactions at Netflix.com. According to Keynote measurements, on the east coast starting at 12am April 21st, Netflix’s performance for successful transactions stayed a consistent couple of seconds and was available 96% of the time. Granted this isn’t flawless execution, and note that the 27 failed data points are all timeouts resulting in just a red screen. However, compared to what happened to many sites, this is outstanding. (Y-axis details obscured.)
It’s not dumb luck that got Netflix off this easy. It’s the product of hard work and engineering time invested in building their Amazon Web Services deployment the right way. As Netflix has been touting in various cloud conferences this year, they’ve been forced to fully embrace AWS due to their tremendous growth. Basically, they only run credit card transactions in their private network. To ensure they always have enough capacity (and incidentally are highly available) they have turned provisioning decisions over to their operational systems. Whenever an Amazon instance is poorly performing they terminate it and get a new one. Likewise if there is an availability zone acting up (like what happened on April 21) then they automatically switch over to another.
This is how real high availability has always been done in networking: ensure that you can automatically failover to logically, physically, and geographically separate resources. Any real engineer will tell you that problems and failures will happen. Your availability track record is not based on how frequently this occurs but how gracefully you recover from them.
Herein is the promise of Cloud Computing: namely the favorable relationship between cost and failover capabilities. In a private network world you would have to build and pay for a lot of infrastructure yourself: multiple data centers, double the hardware, internet access connections on opposite sides of the building, etc. Very quickly the cost of high availability gets prohibitive, locking out all but the deepest of pockets. Netflix explicitly stated at Cloud Connect that, despite their growth, they just weren’t big enough to justify building a network of redundant data centers.
Enter Cloud Computing. Now having access to redundant data centers is just a matter of purchasing the right performance monitoring tools and the engineering time to program your applications and operational systems to take full advantage of on demand resources. In the end, you only pay for the infrastructure you use, not what you might need as is the case when doing it yourself. That’s the real shame and promise highlighted by this outage; young companies like Quora and Foursquare could easily have done just what Netflix has done. The barrier to entry here isn’t a huge budget but the knowledge and priorities to do the work. The next step of course after fully leveraging Amazon is to be able to failover to different cloud providers, and Netflix is probably working on exactly this, right now.
In a way this drives home a point we’ve known all along. Cloud Computing is not outsourcing; this implies a transfer of risk and responsibility. Your business, not Amazon or Microsoft or Google etc., are responsible for the performance of your applications whether they are in the cloud or not. Cloud Computing is a powerful tool to increase performance and availability many fold while reducing costs, if it’s used correctly. If you don’t use the tool properly then an outage isn’t Amazon’s fault, it’s yours. Amazon seems to agree: according to Gartner Analyst Lidya Leong this isn’t an outage that generates service credits. (Quote at very end of article.)
By Ian Withrow
Senior Product Manager / Keynote Systems, Inc.