AWS Outage – Ground-Hog Day Meets Murphy’s Law; You Guys Should Get A Room!

AWS Outage

So, here we go again – I’ve said it once, so I’ll say it again. It gives me no pleasure to write another blog post about AWS suffering another outage in their West Virginia Zone. This because of a couple of reasons: First, the publicity – industry analysts and commentators are divided into two camps, some taking the view that AWS is slightly unfit for the purpose (Barb Darrow from Gigaom: “Cloud outage raises more questions about Amazon Cloud” ) and others taking the more pragmatic view that Instagram, Netflix, etc. (AWS customers) could have been more proactive in protecting themselves against their host going offline (Ingrid Lunden from TechCrunch: “Could Instagram and other sites avoid going down with Amazons Ship”)

The twitter feed kicked in on Saturday morning after the outage on Friday night, and when I saw the first few tweets coming in, I thought it was just people catching up and re-posting regarding the previous AWS issue only two weeks before. But no; it was groundhog day all over again – storm hits; power cut; generators didn’t work; elastic cloud falls over; sites go down!

Checking the hash tags for #Netflix, #Instagram, #AWS and #AWSoutage, I saw all the expected reactions – AWS customers posting stuff like this:

Nearly 28,000 re-tweets, and similar for NetFlix, Pinterest, and Heroku.

The publicity for all of these companies is clearly not good – consumers don’t care or even know what a “host” is. Unless you work in IT, why would you care or want to know? To a consumer, the service they either pay for (Netflix) or use on an hourly basis (Instagram) just doesn’t work, and that type of damage is difficult to undo.

The second reason is that it just gives more fuel to the “I told you so” cloud naysayers. You can just hear old-school CIOs whispering to fearful CEOs all over the world, “the cloud is not ready for us, and we’re certainly not ready for it!”

But I feel I’m repeating myself a bit from my last post, so let’s move on and take an alternate view, one which I subscribe to, and one that infrastructure teams at Netflix et al. would do well to explore.

Putting all your eggs in one basket is clearly a strategy that is both good and bad; good, because you get to be a big customer of a provider; you get economies of scale, better pricing, and someone should pick the phone up when you call, etc., etc., but bad, because you give away some control. When AWS went down, it is clear that many infrastructure teams at customer sites who may have engineered their application to be redundant inside their host didn’t take into account the unthinkable – what happens if the host goes down?

As Michael Lee from ZDNet pointed out in a post on 2 July, quoting Intelligent Business Research Services advisor Jorn Bettin, the blame for the outage may have lain with providers failing to utilise cloud services as they should.

He said that the real issue wasn’t that such a huge cloud-services giant such as Amazon had stumbled over a storm, but that the affected customers – Instagram, Pinterest, Pocket and Netflix (which all suffered from Amazon’s recent outage on the weekend) – hadn’t used the ability of the cloud to create geographically redundant links.

“They could operate at a higher level of redundancy, so that these sort of outages would only have a minimal impact on them. It’s a matter of cost,” Bettin said.

This is the most sensible article I’ve read about the AWS outage issue thus far. Having one provider manage your entire infrastructure without a DR/Back-up strategy with another cloud provider is just commercial madness.

Now, I understand there is a cost element here – the cost of replicating some or all of your infrastructure to spin up when a disaster happens is expensive, isn’t it?

Well, yes and no.

Yes, it’s going to add some level of cost, but what you gain from that is control. You, the System Admin from Pintflixogram, get control to the extent that if your primary host goes down, you get to fire up another, secondary host and maintain your service. Let’s remember AWS is not the only hosting company on the planet. Although they may be perceived as such by many, but in fact there are plenty of regional outfits in the market that are not as cheap as AWS. But guess what – they don’t go down.

On the other hand, if you balance the reputational risk, the customer support calls you have to field, the tickets raised, the PR damage limitation exercise and, finally, the churn as your customer base leaves for your competitor, then no, it’s not expensive.

Companies seem to forget that the quality of hosting service you use is the public perception of your company. You can have the coolest website, the best marketing machine, an awesome product or service, but it all counts for nothing when your customer see’s this:

I can only imagine the frustration and sense of helplessness that the PR folks and the system admins felt, as there is literally nothing they can do to get their service up online until their host tells them they are back up online.

But if they had explored a strategy whereby the client had the control instead of the host, then it could have been service as normal.

AWS are getting hammered, which is understandable from a certain perspective – clients frustrated that their site has gone down; everyone in the space commenting that this shouldn’t happen – but really, the larger clients of AWS who could and should have explored “Redundancy across Regions” (RaR) strategies only have themselves to blame. There is not an industry on the planet that does not have some kind of back-up plan to maintain their core business in the event of a natural disaster, be it as simple as work from home, or a complete replication of their business environment somewhere else.

It’s clear here that some companies had no such plan and just blamed their host, when in fact, if you look at the big picture, it was their own fault. Murphy’s Law exists for a reason, and there are lessons to be learned here.

It’s very simple: You pay for what you get; you pay for greater control and security so that in the event of something bad happening, you don’t have a ground-hog day, and you also beat Murphy at his own game.

By Jason Currill

Jason Currill is a seasoned Executive with over 20 yearsʼ international sales and sales leadership experience in investment banking and information technology. In 2011 he founded Ospero, a global Infrastructure as a Service (IaaS) company. Prior to founding Ospero, he held leadership positions at Cisco Systems, Business Objects (an SAP company) and NetSuite, running both EMEA and NA theaters. In addition,  Jason spent 10 years as a leading Futures Trader in the London International Financial Futures Exchange (LIFFE) for SG Warburg, Nomura and ING Bank.

Threat Security

Azure Red Hat OpenShift: What You Should Know

Azure Red Hat OpenShift: What You Should Know What Is Azure Red Hat OpenShift? Red Hat OpenShift provides a Kubernetes platform for enterprises. Azure Red Hat OpenShift permits you to deploy fully-managed OpenShift clusters in ...
Marcus Schmidt

What IT Leaders Should Know About Microsoft’s Operator Connect

Microsoft’s Operator Connect Earlier this year, Microsoft announced a new calling service for Microsoft Teams (Teams) users called Operator Connect. IT leaders justifiably want to know how Operator Connect is different from Microsoft’s existing PSTN ...
Doug Hazelman Cloudberry

Managing an Increasingly Complex IT Environment

Managing Complex IT Environments The hybrid work model is here to stay—at least for the time being. That’s how things feel in these still uncertain times. This new way of work that has evolved from ...
Brian Rue

What’s Holding DevOps Back

What’s Holding DevOps Back And How Developers and Businesses Can Vault Forward to Improve and Succeed Developers spend a lot of valuable time – sometimes after being woken up in the middle of the night ...
Martin Mendelsohn

Of Rogues, Fear and Chicanery: The Colonial Pipeline Dilemma and CISO/CSO Priorities

The Colonial Pipeline Dilemma The Colonial Pipeline is one of a number of essential energy and infrastructure assets that have been recently targeted by the global ransomware group DarkSide, and other aspiring non-state actors, with ...

CLOUD MONITORING

The CloudTweaks technology lists will include updated resources to leading services from around the globe. Examples include leading IT Monitoring Services, Bootcamps, VPNs, CDNs, Reseller Programs and much more...

  • Opsview

    Opsview

    Opsview is a global privately held IT Systems Management software company whose core product, Opsview Enterprise was released in 2009. The company has offices in the UK and USA, boasting some 35,000 corporate clients. Their prominent clients include Cisco, MIT, Allianz, NewVoiceMedia, Active Network, and University of Surrey.

  • Nagios

    Nagios

    Nagios is one of the leading vendors of IT monitoring and management tools offering cloud monitoring capabilities for AWS, EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service). Their products include infrastructure, server, and network monitoring solutions like Nagios XI, Nagios Log Server, and Nagios Network Analyzer.

  • Datadog

    DataDog

    DataDog is a startup based out of New York which secured $31 Million in series C funding. They are quickly making a name for themselves and have a truly impressive client list with the likes of Adobe, Salesforce, HP, Facebook and many others.

  • Sematext Logo

    Sematext

    Sematext bridges the gap between performance monitoring, real user monitoring, transaction tracing, and logs. Sematext all-in-one monitoring platform gives businesses full-stack visibility by exposing logs, metrics, and traces through a single Cloud or On-Premise solution. Sematext helps smart DevOps teams move faster.