Cloud Based Service Level Agreements
A common theme in the chatter about cloud computing is the need for SLAs around performance and availability. These SLAs simply don’t exist or are specified in a way that makes them look like they’ll never be violated. For example, Amazon’s availability SLA is around the network connectivity for an entire availability zone. So if you have a single instance that can access the internet (whether or not your app will run on it), there is no violation. This is hardly a new state of affairs and there doesn’t seem to be much change coming in future from the major providers. Some businesses and analysts view this as a show stopper/game changer while others could care less. Last week at Cloud Connect I had a moment of crystallization where the ideological divide that is fueling this and other cloud computing controversies became clear to me.
There are two dominant ways of looking at, and valuing the cloud that lead to very different philosophies about what it is and what SLAs should be: the cloud as a business model versus the cloud as a cost savings/efficiency device. The former seems to be held by established companies and the latter by start-ups, although there are certainly exceptions on both sides.
The business model camp, exemplified by Netflix at Cloud Connect, views the cloud as a new way of operating. You design applications for the cloud explicitly using principles such as “design for failure” and automated provisioning of instances. At the extreme, you will terminate and create instances not just when you need more capacity but when any particular instance looks like it may not be 100% healthy. Also this happens in an automated fashion. In this case you need SLAs on the APIs, not the system.
Now Amazon’s system level availability SLA also makes sense.
As a result, providers can use junky, unreliable hardware and as long the provider has enough capacity (like Amazon) it just doesn’t matter. In this way you can ensure equally good performance as you’d get with dedicated infrastructure in a best case, but you avoid many bad cases where performance is degraded by a discrete set of failing servers. So overall average performance increases and becomes more consistent to boot. They money savings come in when your application scales to the capacity it needs at any given time. A happy side benefit is you no longer need to forecast capacity—an exercise that always produces an output that is wrong in some way. Even if your load is steady and you don’t save money this way, the primary driver is about agility and consistent performance.
By contrast, the cost savings/efficiency camp views cloud computing as a logical extension of what you can do with virtualization in-house: save more money both in terms of capital and operating expenses. The savings seem substantial even if you don’t have the interest or capabilities to re-architect, although those who do get these savings as well. Established businesses have the most to gain here since they already have a ton of existing applications and headcount. To get the savings, you directly migrate your applications to the public cloud. However, as you peel the onion you see a number of problems, such as what happens if you get provisioned onto an instance with greedy neighbor on it? Worse is that those who architected their applications for the public cloud will detect this and move, leaving you more likely to get stuck in these situations overtime. So you ask the provider to assume the risk in a way you are familiar with: SLAs similar to what you get from your ISPs and in terms of performance guarantees.
So setting aside normative questions and religion, what does this all mean? Well one key point is that if you embrace the business model, you’ll save more money. You’ll also of course get all the benefits from the new model, like elasticity and agility—a point that the cost savings camp largely conceded at Cloud Connect. However, I’m a pragmatist. It is simply not realistic to expect a global enterprise to retool all its application en mas for the cloud. Being morally right has nothing to do with it.
This brings us to the interesting question.
Why would Amazon and company break down and offer the SLAs that the people want? I’m not so sure they will. Amazon is growing gangbusters and catering to the second group may hurt their ability to pursue the former, for marketing or technical reasons. The other major players who aspire to Amazon’s market are unlikely to break ranks either. Guess who may step in though? The carriers. After all, these are the guys who already offer the SLAs that the enterprise is accustomed to. So my prediction is that if this happens, you will quickly hear allegations that these “traditional SLA” backed services aren’t in fact ‘real’ cloud computing offerings and the schism will deepen.
As a final note, I have to say that I think in the long term, however many years it takes, more and more companies will embrace the business model approach to cloud computing. As I said, the experts on both sides seem to agree that designing for the cloud will save you more money and bring many other compelling and hard to ignore benefits to the business. If businesses subscribe to this line of thinking, it may well seem to Amazon and others that the right long term play is to ignore the customer.
So if you are waiting for better SLAs to be offered by the major cloud computing providers, you may be waiting a long time…
By Ian Withrow / Senior Product Manager Keynote Systems, Inc.