Paul Mercina

Mitigating the Downtime Risks of Virtualization

Mitigating the Downtime Risks

Nearly every IT professional dreads unplanned downtime. Depending on which systems are hit, it can mean angry communications from employees and the C-suite and often a Twitterstorm of customer ire. See the recent Samsung SmartThings dustup for an example of how much trust can be lost in just one day.

Gartner pegs the financial costs of downtime at $5,600 per minute, or over $300,000 per hour. And a survey by IHC found enterprises experience an average of five downtime events each year, resulting in losses of $1 million for a midsize company to $60 million or more for a large corporation. In addition, the time spent recovering can leave businesses with an “innovation gap,” an inability to redirect resources from maintenance tasks to strategic projects.

The quest for downtime-minimizing technologies remains hot, especially as demand for high-availability IT has grown. Where “four nines” (99.99%) uptime might once have sufficed, five nines or six nines is now expected.

Enter server virtualization, the powerful technology enabling administrators to partition servers, increase utilization rates, and spread workloads across multiple physical devices. It’s a powerful and increasingly popular technology, but it can be a mixed blessing when it comes to downtime.

Mitigating the Downtime

Virtualization Minimizes Some Causes, Exacerbates Some Impacts of Downtime

Virtualization is no panacea, but that’s not a call to reconsider industry enthusiasm for it. Doing so would be unproductive anyway. The data center virtualization market, already worth $3.75 billion in 2017, is expected to grow to $8.06 billion by 2022. For good reason. Virtualization has many advantages, some of them downtime-related. For example, it’s easier to employ continuous server mirroring for more seamless backup and recovery.

These benefits are well documented by virtualization technology vendors like VMWare and in the IT literature generally. Less frequently discussed are the compromises enterprises make with virtualization, which often boil down to an “all eggs in one basket” problem.

What used to be discrete workloads running on multiple, separate physical servers can in a virtualized environment be consolidated to a single server. The combination of server and hypervisor then become a single point of failure, which can have an outside impact on operations for many reasons.

Increased utilization

First of all, today’s virtualized servers are doing more work. According to a McKinsey & Company report, utilization rates in non-virtualized equipment was mired at 6% to 12%, and Gartner research had similar findings. Virtualization can drive that figure up to 30% or 50% and sometimes higher. Even back-of-the-napkin math shows any server outage has several times the impact of yesteryear, simply because there is more compute happening within any given box.

Diverse customer consequences

Prior to virtualization, co-location customers, among others, demanded dedicated servers to handle their workloads. Although some still do, the cloud has increased comfort with sharing physical resources by using virtual machines (VMs). Now a single server with virtual partitions could be a resource for dozens of clients, vastly expanding the business impact of downtime. Instead of talking to one irate individual demanding a refund, customer service representatives could be getting emails, tweets, and calls from every corner.

This holds true for on-premises equipment as well. The loss of a single server could as easily affect the accounting systems the finance department relies on, the CRM system the sales team needs, and resources various customer-facing applications demand, all at the same time. It’s a recipe for help desk meltdown.

Added complexity

According to CIO Magazine, many virtualization projects “have shifted rather than eliminated complexity in the data center.” In fact, of the 16 annual outages per year their survey respondents reported, 11 were caused by system failure resulting from complexity. And the more complex the environment, the more difficult the troubleshooting process can be, which can lead to longer, more harmful downtime experiences.

Thin client

Although not a direct result of virtualization, the industry has made yet another swing of the centralization versus decentralization pendulum. After years of powerful PCs loaded with local applications, we have entered an age of mobile, browser-based, and other very thin client solutions. In many cases, the client does little but collect bits of data for processing elsewhere. Not much can happen at the device level if the cloud-based or other computing resources are unavailable. The slightest problem can result in mounting user frustration as apps crash and error messages are returned.

In summary, the data center of 2018 houses servers that are doing more, for more internal and external customers. At the same time complexity is bringing about downtime risk with problems that can be more difficult to solve, which can lead to extended outages. Although effective failover, backup, and recovery processes can help mitigate the combined effects, these tactics alone are not enough.

Additional Solutions for Minimizing Server Downtime

It may sound old school, but data center managers need to stay focused on IT equipment. These failures account for 40% of all reported downtime. Compare that figure with the 25% caused by human error, whether by internal staff or service providers, and the 10% by cyberattacks. To have the greatest positive effect on uptime, hardware should obviously be the first target.

There are several recommendations data center managers should implement, if they haven’t already done so:

  • Perform routine maintenance regularly. It should go without saying but often doesn’t. Install recommended patches, check for physical issues like airflow blockages, and heed all alerts and warnings. Maintenance is fundamental but it is no less essential. That means training employees, scheduling tasks, and tracking completion. If maintenance can’t happen on time, all the time, seek outside assistance to get it done so available internal resources can focus on strategic projects and those unavoidable fire drills without leaving systems in jeopardy.
  • Monitor your resources. The first you hear of an outage should never be from a customer. Full-time, 24/7 systems monitoring is a must for any enterprise. Fortunately, there are new, AI-driven technologies combining monitoring with advanced predictive maintenance capabilities for immediate fault detection and integrated, quick-turnaround response. Access is less expensive than you might think.
  • Upgrade your break/fix plan. A disorganized parts closet or an eBay strategy won’t work. Rapid access to spares is vital in getting systems back online without delay. Especially for mission critical systems, station repair kits on site or work with a vendor who can do so and/or deliver spares within hours.
  • Invest in expertise. Parts are only part of the equation. There is significant skill involved in troubleshooting systems in these increasingly complex data center environments. The current IT skills gap may necessitate looking outside the enterprise to complement existing engineering capabilities with those of a third-party provider.
  • Test everything. Data centers evolve, but conducting proof-of-principle testing on each workload before any changes are made will cut down on virtualization problems before they happen. By the same token, systems recovery and DR scenarios are unknowns unless they are real-world verified. Try pulling a power cord and see what happens. Does that idea give you pause? It might be time for some enhancements.

There is good news for IT organizations already overwhelmed by demands to maintain more complex environments, execute the digital transformation, and achieve it all with fewer resources and less money, in a tight labor market to boot. Alternatives exist.

Third-party maintenance providers can take on a substantial portion of the equipment-related upkeep, troubleshooting, and support tasks in any data center. With a premium provider on board, it’s possible radically reduce downtime and reach the availability and reliability goals you’d hoped to achieve when you took the virtualization path in the first place.

By Paul Mercina

THOUGHT LEADERS

Aarti Parikh

Serverless Multi-Tier Architecture on AWS

Serverless Multi-Tier Architecture Multi-tier Architecture Multi-tier Architecture is also known as n-tier architecture. In such architecture, an application is developed and distributed in more than ...
Armen Najarian

Martech: Brand Marketing is the New Demand Generation

Martech: Brand Marketing First, An Apology Sorry, demand generation professionals. We still love you and your jobs aren’t going away. But, as you are well aware, the ...
Mark Casey Apcela

How to Optimize Your Office 365 Performance with Network Peering

Optimize Performance with Network Peering Microsoft Office 365 usage has grown significantly in recent years. More than 56 percent of organizations all around the world ...

TECH NEWS

Reuters

New York drops fight against T-Mobile-Sprint merger

NEW YORK (Reuters) - New York on Sunday dropped its fight against the $40 billion merger of U.S. wireless carriers T-Mobile US Inc (TMUS.O) and ...
Orange

Telecoms group Orange to also pull out from Mobile World Congress: source

PARIS (Reuters) - French telecoms group Orange has also decided to pull out of the Mobile World Congress telecoms event in Barcelona due to concerns ...
Cisco Logo.jpg

Cisco’s Annual Internet Report Shows the Massive Growth of Europe’s Digital Future for EU Policymakers

Today, Cisco’s Annual Internet Report (AIR) was published – setting out the trends that will define our global communication networks for the next five years ...