Business Continuity: Let Us Plan For A Cloud Failure
Every cloud proponent has been saying that the cloud is the only safe place left for our businesses and various other IT systems, while “nay-sayers” blabber about the total opposite, they talk about privacy and security concerns, and even data integrity. But no matter which side you are on, nothing is full-proof and all will eventually fail in one way or another. Even with business continuity plans and implementations, geographically scattered backup systems, there will always be that one big problem in the future that will ensure that all of them go down simultaneously.
But the scenario above is quite unlikely unless we are talking about a disaster on a global scale. Yes, cloud computing may fail, but not all of it, and not at once. This is the beauty of cloud computing, its disjointedness, its seemingly random choice of server locations, and of course the sheer number of them when you combine those of different service providers. Hence, the first step for a real business continuity plan is to plan for cloud failures, not just local failures.
The key is to design your IT infrastructure around the idea that one of the servers hosting your applications WILL go down in the future. Then you find your solution for this. A simple solution is to scatter those servers around the globe, using different providers or one provider that can provide you with control on which servers you use to run your services from. Make sure that codependent systems and subsystems can act independently to some extent. For example if a certain function is down, make sure another function will take its place to try to act as a backup with some functionality rather than have the whole system go down. We can now launch servers anywhere in the world using a laptop or even a smartphone, and have them run for a few cents an hour. There are so many options out there and we are now at a state that the level of affordability of business continuity is unimaginable just a few years ago.
One very good way to test the overall resilience of the system is to randomly shut off a part of the system to see if the whole will still work. This also gives developers a way to test interconnectivity and the integrity of your business continuity solution. Netflix, a real avid user of the Amazon Web Services (AWS) cloud service, calls their version of this “random kill” system the “Chaos Monkey” because of the unpredictability of which service will be shut off next. This is a very good test case for a high load system which is in high demand due to the nature of their service, streaming movies and videos.
By Abdul Salam