Is Your Company Ready For A Cloud Service Outage?

Is Your Company Ready For A Cloud Service Outage?

If you are using one of the major CSPs (cloud service providers) you may already be used to major service outages. Amazon EC2, Rackspace, Google Apps and Microsoft Azure have all had their fair share of outages in the past 18 months, and some of them have been huge failures (Amazon’s April 2011 outage lasted 47 hours for some customers), which have brought down sites such as Reddit. However, companies such as Netflix have been able to survive the outages well. In this post, I will cover details on how you can effectively manage a service outage by taking note of best practices from Netflix and other companies that have successfully weathered the storm.

Create a disaster management fire-drill

Fire drills should be an essential part of cloud management. Once a month, have a fire drill to simulate failures in different parts of the system to see how your systems hold up in case of failures. This includes preparing your PR and customer service personnel, instituting quality control processes and executing an executive-level contingency plan to prevent panic from gripping the company.

Have all your data backed up securely

Periodically back up data and store it away from your primary CSP. For example, you could have an Amazon S3 instance to back up your Rackspace cloud installation. This will mitigate against a single point of failure.

Keep another service provider ready

Have another CSP ready to run an instance of your server at short notice, if needed. Even if it doesn’t provide full features for the site, this plan B should provide a minimum working subset. (In an email application, for instance, the service must allow you to send and receive email, even if contracts and archive access is not restored.)

Create stateless systems

One of the lessons Netflix offered was to build stateless systems where possible. That means a new request from the client can be served by any of the available servers, even if the original server to which the client made the request is down. This requires very careful planning during development.

Work on graceful degradation

Your system has a graceful degradation when a certain percentage of failure causes only an equivalent percentage drop in performance instead of bringing down the whole system. To enable graceful degradation, you must detect failures quickly (set  quick timeouts when the system recognizes a failure) and shut down all non-essential features of the system (to save precious resources for critical features).

Create a communication plan to keep customers in the loop

If there is an outage at your CSP, your customers will also be affected. Imagine you are running an online store and the outage has prevented you from shipping. Even if none of the previous disaster management steps worked, you could save some of the bad press by keeping the customers and other stakeholders regularly informed of the status. Identify proper communication channels and create a plan to keep all of them in the loop. This requires having a backup of customer contact data, writing FAQs and preparing your employees to handle questions appropriately.

Having a proper service outage plan is essential to the survival of your business in the long-term. It could save a lot of headaches, not to mention your brand value, when failures do happen.

By Balaji Viswanathan

About Balaji

Balaji Viswanathan is the founder of Agni Innovation Labs that helps startups and small businesses with their marketing and tech strategy. He has a Masters in Computer Science from the University of Maryland and has been blogging for the past 7 years on technology and business related topics.

View All Articles

Sorry, comments are closed for this post.

Increasing Efficiency and Reducing Cost with Managed Printing Services

Increasing Efficiency and Reducing Cost with Managed Printing Services

Managed Printing Services This is a sponsored post written on behalf of HP MPS.  Today’s business leaders recognize the value of shared services, significantly providing a wide range of enterprises with the sophisticated tools they need to compete with big business, while additionally driving costs down. But an area often overlooked in our tech-savvy world…

Curing Cancer With Big Data

Curing Cancer With Big Data

Cancer & Big Data The fight against cancer has been going on for centuries. Many leaders have tried and failed to bring about change to cancer treatment. Richard Nixon famously declared a War on Cancer with the National Cancer Act of 1971, and while the Nixon administration certainly increased research funding, they ultimately fell short…

IoT & Predictive Analytics In Healthcare

IoT & Predictive Analytics In Healthcare

IoT & Predictive Analytics In a report by Grand View Research Inc., it’s predicted that the global Internet of Things (IoT) healthcare market will reach nearly $410 billion by 2022, with mobile penetration, software automation, and innovation medical devices promising rapid testing, greater accuracy, portability, and user-friendliness. Chronic diseases such as obesity, diabetes, heart failure,…

What Do You Do With A Drunken Sailor? Look To The Cloud

What Do You Do With A Drunken Sailor? Look To The Cloud

Saferide App Petty officer Michael Daigle of the US Navy has rolled out an app to curb drunken driving in the Navy ranks. His Saferide ride-sharing service runs on Voxox’s Cloud Phone service which costs just $15 a month and is popular among US Service members abroad, where mobile phones are not permitted for use…

Red Hat Offers Container Native Persistent Storage for Linux Containers

Red Hat Offers Container Native Persistent Storage for Linux Containers

Red Hat Offers Container Storage Latest Red Hat Gluster Storage release enables greater agility and efficiency for OpenShift developers deploying application containers in production SAN FRANCISCO – RED HAT SUMMIT – June 28, 2016 – Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced new storage innovations designed to enable developers to…

Three Factors for Choosing Your Long-term Cloud Strategy

Three Factors for Choosing Your Long-term Cloud Strategy

Choosing Your Long-term Cloud Strategy A few weeks ago I visited the global headquarters of a large multi-national company to discuss cloud strategy with the CIO. I arrived 30 minutes early and took a tour of the area where the marketing team showcased their award winning brands. I was impressed by the digital marketing strategy…

Cloud Computing – A Requirement For Greater Innovation

Cloud Computing – A Requirement For Greater Innovation

Cloud Computing Innovation Sao Paulo, Brazil has had trouble with both energy and water supplies as of late. Despite it is the rainy period. Unfortunately Sao Paulo is very dependent on its rain as a majority of its power is generated from large dams. No water, no energy. Difficult situation for a city of some…

Using Big Data To Analyze Venture Capitalists’ Ability To Recognize Potential

Using Big Data To Analyze Venture Capitalists’ Ability To Recognize Potential

Big Data To Analyze Using Big Data to Analyze Venture Capitalists’ Ability To Recognize Potential For those who are regularly involved with SMEs, venture capital, and company valuations, it is common knowledge that start-ups that exit for more than $1 billion dollars are extremely rare – often termed ‘unicorn’ companies. Despite their rarity, it should…

4 Different Types of Attacks – Understanding the “Insider Threat”

4 Different Types of Attacks – Understanding the “Insider Threat”

Understanding the “Insider Threat”  The revelations that last month’s Sony hack was likely caused by a disgruntled former employee have put a renewed spotlight on the insider threat. The insider threat first received attention after Edward Snowden began to release all sorts of confidential information regarding national security. While many called him a hero, what…

New Report Finds 1 Out Of 3 Sites Are Vulnerable To Malware

New Report Finds 1 Out Of 3 Sites Are Vulnerable To Malware

1 Out Of 3 Sites Are Vulnerable To Malware A new report published this morning by Menlo Security has alarmingly suggested that at least a third of the top 1,000,000 websites in the world are at risk of being infected by malware. While it’s worth prefacing the findings with the fact Menlo used Alexa to…

Are Women Discriminated Against In The Tech Sector?

Are Women Discriminated Against In The Tech Sector?

Women Discriminated Against In Tech Sector It is no secret that the tech industry is considered sexist since most women are paid less than men; there are considerably fewer women in tech jobs; and generally men get promoted above women. Yet the irony is twofold. Firstly, there is an enormous demand for employees with skills…

Cloud Infographic – Interesting Big Data Facts

Cloud Infographic – Interesting Big Data Facts

Big Data Facts You Didn’t Know The term Big Data has been buzzing around tech circles for a few years now. Forrester has defined big data as “Technologies and techniques that make capturing value from data at an extreme scale economical.” The key word here is economical. If the costs of extracting, processing, and making use…