Is Your Company Ready For A Cloud Service Outage?

Is Your Company Ready For A Cloud Service Outage?

If you are using one of the major CSPs (cloud service providers) you may already be used to major service outages. Amazon EC2, Rackspace, Google Apps and Microsoft Azure have all had their fair share of outages in the past 18 months, and some of them have been huge failures (Amazon’s April 2011 outage lasted 47 hours for some customers), which have brought down sites such as Reddit. However, companies such as Netflix have been able to survive the outages well. In this post, I will cover details on how you can effectively manage a service outage by taking note of best practices from Netflix and other companies that have successfully weathered the storm.

Create a disaster management fire-drill

Fire drills should be an essential part of cloud management. Once a month, have a fire drill to simulate failures in different parts of the system to see how your systems hold up in case of failures. This includes preparing your PR and customer service personnel, instituting quality control processes and executing an executive-level contingency plan to prevent panic from gripping the company.

Have all your data backed up securely

Periodically back up data and store it away from your primary CSP. For example, you could have an Amazon S3 instance to back up your Rackspace cloud installation. This will mitigate against a single point of failure.

Keep another service provider ready

Have another CSP ready to run an instance of your server at short notice, if needed. Even if it doesn’t provide full features for the site, this plan B should provide a minimum working subset. (In an email application, for instance, the service must allow you to send and receive email, even if contracts and archive access is not restored.)

Create stateless systems

One of the lessons Netflix offered was to build stateless systems where possible. That means a new request from the client can be served by any of the available servers, even if the original server to which the client made the request is down. This requires very careful planning during development.

Work on graceful degradation

Your system has a graceful degradation when a certain percentage of failure causes only an equivalent percentage drop in performance instead of bringing down the whole system. To enable graceful degradation, you must detect failures quickly (set  quick timeouts when the system recognizes a failure) and shut down all non-essential features of the system (to save precious resources for critical features).

Create a communication plan to keep customers in the loop

If there is an outage at your CSP, your customers will also be affected. Imagine you are running an online store and the outage has prevented you from shipping. Even if none of the previous disaster management steps worked, you could save some of the bad press by keeping the customers and other stakeholders regularly informed of the status. Identify proper communication channels and create a plan to keep all of them in the loop. This requires having a backup of customer contact data, writing FAQs and preparing your employees to handle questions appropriately.

Having a proper service outage plan is essential to the survival of your business in the long-term. It could save a lot of headaches, not to mention your brand value, when failures do happen.

By Balaji Viswanathan

About Balaji

Balaji Viswanathan is the founder of Agni Innovation Labs that helps startups and small businesses with their marketing and tech strategy. He has a Masters in Computer Science from the University of Maryland and has been blogging for the past 7 years on technology and business related topics.

View All Articles

Sorry, comments are closed for this post.

The Evolution Of The Connected Cloud

The Evolution Of The Connected Cloud

The Connected Cloud Cloud computing is interesting first, but not only, because of the prevalence of cloud projects. There are many of them launched every day. Some have lofty expectations for business benefits (cost saving of 20 percent or more) and others carry even more intriguing goals. In 2005 “the cloud” was new. Shared computing…

How To Use Big Data And Analytics To Help Consumers

How To Use Big Data And Analytics To Help Consumers

Big Data Analytics Businesses are under increasing pressure to develop data-driven solutions. The competitive advantage gained by a successful strategy can be immense. It can create new opportunities and help businesses to react to different scenarios or sudden changes in the market. But innovation and resilience are not easily achieved, and organizations always face difficult…

Five Cloud Questions Every CIO Needs To Know How To Answer

Five Cloud Questions Every CIO Needs To Know How To Answer

The Hot Seat Five cloud questions every CIO needs to know how to answer The cloud is a powerful thing, but here in the CloudTweaks community, we already know that. The challenge we have is validating the value it brings to today’s enterprise. Below, let’s review five questions we need to be ready to address…

Cost of the Cloud: Is It Really Worth It?

Cost of the Cloud: Is It Really Worth It?

Cost of the Cloud Cloud computing is more than just another storage tier. Imagine if you’re able to scale up 10x just to handle seasonal volumes or rely on a true disaster-recovery solution without upfront capital. Although the pay-as-you-go pricing model of cloud computing makes it a noticeable expense, it’s the only solution for many…

The Modular Drone Concept In Action

The Modular Drone Concept In Action

The Modular Drone Concept As the Internet of Things (IoT) world explodes around us, it is interesting to think about new ways of solving old problems. For example, drones allow for a potential solutions to a number of long-standing problems. Aerial drones that can carry modules are appearing. These new modular drones have a number…

The Global Impact of the USA Freedom Act on Cloud Adoption

The Global Impact of the USA Freedom Act on Cloud Adoption

The Global Impact On Cloud Adoption It’s hard to believe it has been a little over 18 months since Edward Snowden first revealed information about the NSA’s secret surveillance programs to the world. Since that day in June 2013 data privacy and security, specifically in the cloud, have taken on a whole new level of…

Finally, The Time For Security Information Event Management (SIEM)

Finally, The Time For Security Information Event Management (SIEM)

The Time For SIEM Security Information Event Management (SIEM) tools have been around for a long time. My first encounter with a SIEM vendor was about twenty years ago while being courted to resell their product. To this day, I still recall two vivid memories from that meeting; the product was very complex and quite…

CloudTweaks is recognized as one of the leading influencers in cloud computing, infosec, big data and the internet of things (IoT) information. Our goal is to continue to build our growing information portal by providing the best in-depth articles, interviews, event listings, whitepapers, infographics and much more.

Sponsor