Is Your Company Ready For A Cloud Service Outage?

Is Your Company Ready For A Cloud Service Outage?

If you are using one of the major CSPs (cloud service providers) you may already be used to major service outages. Amazon EC2, Rackspace, Google Apps and Microsoft Azure have all had their fair share of outages in the past 18 months, and some of them have been huge failures (Amazon’s April 2011 outage lasted 47 hours for some customers), which have brought down sites such as Reddit. However, companies such as Netflix have been able to survive the outages well. In this post, I will cover details on how you can effectively manage a service outage by taking note of best practices from Netflix and other companies that have successfully weathered the storm.

Create a disaster management fire-drill

Fire drills should be an essential part of cloud management. Once a month, have a fire drill to simulate failures in different parts of the system to see how your systems hold up in case of failures. This includes preparing your PR and customer service personnel, instituting quality control processes and executing an executive-level contingency plan to prevent panic from gripping the company.

Have all your data backed up securely

Periodically back up data and store it away from your primary CSP. For example, you could have an Amazon S3 instance to back up your Rackspace cloud installation. This will mitigate against a single point of failure.

Keep another service provider ready

Have another CSP ready to run an instance of your server at short notice, if needed. Even if it doesn’t provide full features for the site, this plan B should provide a minimum working subset. (In an email application, for instance, the service must allow you to send and receive email, even if contracts and archive access is not restored.)

Create stateless systems

One of the lessons Netflix offered was to build stateless systems where possible. That means a new request from the client can be served by any of the available servers, even if the original server to which the client made the request is down. This requires very careful planning during development.

Work on graceful degradation

Your system has a graceful degradation when a certain percentage of failure causes only an equivalent percentage drop in performance instead of bringing down the whole system. To enable graceful degradation, you must detect failures quickly (set  quick timeouts when the system recognizes a failure) and shut down all non-essential features of the system (to save precious resources for critical features).

Create a communication plan to keep customers in the loop

If there is an outage at your CSP, your customers will also be affected. Imagine you are running an online store and the outage has prevented you from shipping. Even if none of the previous disaster management steps worked, you could save some of the bad press by keeping the customers and other stakeholders regularly informed of the status. Identify proper communication channels and create a plan to keep all of them in the loop. This requires having a backup of customer contact data, writing FAQs and preparing your employees to handle questions appropriately.

Having a proper service outage plan is essential to the survival of your business in the long-term. It could save a lot of headaches, not to mention your brand value, when failures do happen.

By Balaji Viswanathan

About Balaji

Balaji Viswanathan is the founder of Agni Innovation Labs that helps startups and small businesses with their marketing and tech strategy. He has a Masters in Computer Science from the University of Maryland and has been blogging for the past 7 years on technology and business related topics.

View All Articles

Sorry, comments are closed for this post.

Driving Insight: Analytics And The Internet of Things

Driving Insight: Analytics And The Internet of Things

Analytics And The Internet of Things  For many businesses, the Internet of Things is playing an increasingly important role, influencing day-to-day operations and strategic planning. An ecosystem of growing complexity and sophistication, the IoT calls for careful navigation: advances in connectivity and cloud-based platforms have opened up a wider range of solutions to IT decision-makers…

Principles For Data Protection In The Cloud In 2016

Principles For Data Protection In The Cloud In 2016

Data Protection In The Cloud 2015 ushered in the start of a data economy. As organizations amass more detailed consumer profiles they have begun realizing that data could equal or surpass the value of the products and services they sell, especially in the Internet of Things era with its constant and very personal streams of…

Transforming Traditional DevOps To A Modern Cloud-Centric Operation

Transforming Traditional DevOps To A Modern Cloud-Centric Operation

Transforming Traditional DevOps Over the last year, I’ve been hearing about more and more instances of companies asking the question that the title suggests – how do you transform a DevOPs process into a more cloud-centric operation? To start, we must all assume that there is some notion of a traditional DevOPs process built into…

Dissecting Mr. Robot TV Series: Spotlight On Burning Online Privacy Risks

Dissecting Mr. Robot TV Series: Spotlight On Burning Online Privacy Risks

Mr. Robot And Burning Online Privacy Risks Despite the rapid development of web tools and computer security systems, online privacy remains a serious issue for most web users. According to some statistics provided by isaca.org, the total number of online security incidents worldwide grew to 42.8 million, leaving a great number of victims behind. As one…

World Backup Day: Understand The Data You Are Protecting

World Backup Day: Understand The Data You Are Protecting

World Backup Day: Understand The Data You Are Protecting Did you know that 113 phones are lost or stolen every minute? What about the fact that 1 in 10 computers are infected with a virus every month? Thanks to World Backup Day, an independent initiative that was started in 2011, awareness is being raised about…

New Smartphones From Apple, Samsung and HTC Promise To Light Up 2016

New Smartphones From Apple, Samsung and HTC Promise To Light Up 2016

New Smartphones from Apple, Samsung and HTC (Sponsored post courtesy of Verizon Wireless) The launch of the Galaxy S7 Edge at the Mobile World Congress in Barcelona during February was the first shot in a vintage year for mobile phones. The S7 is an incredible piece of hardware, but launches from HTC and Apple later in the…

Featured Sponsored Articles
How Successful Businesses Ensure Quality Team Communication

How Successful Businesses Ensure Quality Team Communication

Quality Team Communication  (Sponsored post courtesy of Hubgets) Successful team communication and collaboration are as vital to project and overall business success as the quality of products and services an organization develops. We rely on a host of business tools to ensure appropriate customer interactions, sound product manufacturing, and smooth back-end operations. However, the interpersonal relationships…

Featured Sponsored Articles
How To Develop A Business Continuity Plan Using Internet Performance Management

How To Develop A Business Continuity Plan Using Internet Performance Management

Internet Performance Management Planning CDN Performance Series Provided By Dyn In our previous post, we laid out the problems of business continuity and Internet Performance Management in today’s online environment.  In this article, we will take a look at some of the ways you can use traffic steering capabilities to execute business continuity planning and…

Featured Sponsored Articles

CloudTweaks is recognized as one of the leading influencers in cloud computing, infosec, big data and the internet of things (IoT) information. Our goal is to continue to build our growing information portal by providing the best in-depth articles, interviews, event listings, whitepapers, infographics and much more.

Sponsor