Disaster Recovery
Ok, ok – I understand most of you are saying disaster recovery (DR) is still a critical aspect of running any type of operations. After all – we need to secure our future operations in case of disaster. Sure – that is still the case but things are changing – fast.
There are really two things forcing us to look at disaster recovery differently across the board. On the one hand the sheer volume of data is rapidly becoming unmanageable. On the other hand – there are really few customer facing services that do not require 100% uptime and are considered mission critical. As a leading IaaS-provider we know that the person running his first e-commerce offering with zero income feels he or she is losing as bad as the large company that might be losing millions per hour when down. The feeling and result is the same no matter what stage you are in running your business. We all feel it is always critical to be up and running.
The first time we realized that DR in its traditional meaning will not work was when we setup OpenStack Swift over 5 nodes in 3 geographical spread data centers intended for volumes in Petabytes. It really only comes down to one aspect. Recovery time. Sure we have large volumes since many years back but we are fast approaching where the time to get things back – becomes too long for it to be a viable solution. The downtime in a full disaster would be too great. Even in the same data centers – should you truly need to get a few Petabytes copied from one set of hardware to another – it will take some time. Too long? Most likely. While there are ways to divide and conquer large data – it is time to think differently.
Data Points
As the volume of data goes in one direction – the acceptable down time in general – goes the other direction. Down. The solution? Multiple data centers that allows for live-live service with contained restore points locally in each data center.
Logical errors you say? From time to time human errors might force us to be able to restore from an older version. No doubt restore points are a must regardless of how you build your service. Those can many times be done in more contained parts of each solution. They can be done locally in the same DC where local networks allow for greater speed. With a live-live solution running over two or more data centers you can do maintenance with much less risk of having to take the service down – regardless of the situation.
Ask yourself – all that data that I am shooting to a different data center for DR – when did you do a full scale test to see how long it would be to restore 100 TB or more? Is it not time to go live – live over multiple data centers for all of your critical services? If you are running your services in the cloud – schedule full tests and make sure you contingency plans are up-to-date because when disaster strikes – they are what stands between you and true business disaster.
By Johan Christenson