Microsoft Azure Outage Blamed on Leap Year

Microsoft Azure Outage Blamed on Leap Year

All of us will remember the fears associated with the Y2K problem, where computers’ inability to distinguish between the years 1900 and 2000 was supposed to create a slew of problems. Although problems were never as severe as some doomsday prophets predicted, often due to precautionary measures, there were some incidents that did affect normal life. Now, it seems that another date problem is to blame for Microsoft’s cloud outage late last month.

28 February saw Azure customers facing problems throughout the globe. According to Mary Jo Foley of ZDNet, “Azure problems began with an outage in the Windows Azure Management Service technology, which then spread to the Windows Azure Compute and Access Control parts of the platform. Affected areas included North Europe, North Central US and South Central US regions.” While Microsoft did respond quickly, quite in contrast to the Amazon outage last year (See: Reactions to the Amazon Cloud Outage and the Company’s Explanation), the underlying reason is a major embarrassment to the company.

February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. Once we discovered the issue, we immediately took steps to protect customer services that were already up and running and began creating a fix for the issue. The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57 a.m. PST.” Bill Laing, corporate vice president of Microsoft’s Server and Cloud Division, wrote in the company blog.

In other words, Microsoft’s inability to factor in the leap year, a phenomenon that occurs every four years, is to blame. For a company that is so much invested in cloud computing (See: Is Microsoft Taking A Risk By Putting All Its Eggs In The Cloud Computing Basket? ), such an oversight is a major embarrassment. However, considering cloud computing’s relative infancy, consumers may be willing to forgive such indiscretions. At the same time, providers must acknowledge the shortcomings in their knowledge and learn from such incidents. As Charles Babcock of Information Week remarked, “This incident is a reminder that the best practices of cloud computing operations are still a work in progress, not an established science. And while prevention is better than cure, infrastructure-as-a-service operators may not know everything they need to about these large-scale environments.”

By Sourya Biswas

sourya

Sourya Biswas is a former risk analyst who has worked with several financial organizations of international repute, besides being a freelance journalist with several articles published online. After 6 years of work, he has decided to pursue further studies at the University of Notre Dame, where he has completed his MBA. He holds a Bachelors in Engineering from the Indian Institute of Information Technology. He is also a member of high-IQ organizations Mensa and Triple Nine Society and has been a prolific writer to CloudTweaks over the years... http://www.cloudtweaks.com/author/sourya/

cloud-sponsorship

Add Comment Here