Paul Mercina

Mitigating the Downtime Risks of Virtualization

Mitigating the Downtime Risks

Nearly every IT professional dreads unplanned downtime. Depending on which systems are hit, it can mean angry communications from employees and the C-suite and often a Twitterstorm of customer ire. See the recent Samsung SmartThings dustup for an example of how much trust can be lost in just one day.

Gartner pegs the financial costs of downtime at $5,600 per minute, or over $300,000 per hour. And a survey by IHC found enterprises experience an average of five downtime events each year, resulting in losses of $1 million for a midsize company to $60 million or more for a large corporation. In addition, the time spent recovering can leave businesses with an “innovation gap,” an inability to redirect resources from maintenance tasks to strategic projects.

The quest for downtime-minimizing technologies remains hot, especially as demand for high-availability IT has grown. Where “four nines” (99.99%) uptime might once have sufficed, five nines or six nines is now expected.

Enter server virtualization, the powerful technology enabling administrators to partition servers, increase utilization rates, and spread workloads across multiple physical devices. It’s a powerful and increasingly popular technology, but it can be a mixed blessing when it comes to downtime.

Mitigating the Downtime

Virtualization Minimizes Some Causes, Exacerbates Some Impacts of Downtime

Virtualization is no panacea, but that’s not a call to reconsider industry enthusiasm for it. Doing so would be unproductive anyway. The data center virtualization market, already worth $3.75 billion in 2017, is expected to grow to $8.06 billion by 2022. For good reason. Virtualization has many advantages, some of them downtime-related. For example, it’s easier to employ continuous server mirroring for more seamless backup and recovery.

These benefits are well documented by virtualization technology vendors like VMWare and in the IT literature generally. Less frequently discussed are the compromises enterprises make with virtualization, which often boil down to an “all eggs in one basket” problem.

What used to be discrete workloads running on multiple, separate physical servers can in a virtualized environment be consolidated to a single server. The combination of server and hypervisor then become a single point of failure, which can have an outside impact on operations for many reasons.

Increased utilization

First of all, today’s virtualized servers are doing more work. According to a McKinsey & Company report, utilization rates in non-virtualized equipment was mired at 6% to 12%, and Gartner research had similar findings. Virtualization can drive that figure up to 30% or 50% and sometimes higher. Even back-of-the-napkin math shows any server outage has several times the impact of yesteryear, simply because there is more compute happening within any given box.

Diverse customer consequences

Prior to virtualization, co-location customers, among others, demanded dedicated servers to handle their workloads. Although some still do, the cloud has increased comfort with sharing physical resources by using virtual machines (VMs). Now a single server with virtual partitions could be a resource for dozens of clients, vastly expanding the business impact of downtime. Instead of talking to one irate individual demanding a refund, customer service representatives could be getting emails, tweets, and calls from every corner.

This holds true for on-premises equipment as well. The loss of a single server could as easily affect the accounting systems the finance department relies on, the CRM system the sales team needs, and resources various customer-facing applications demand, all at the same time. It’s a recipe for help desk meltdown.

Added complexity

According to CIO Magazine, many virtualization projects “have shifted rather than eliminated complexity in the data center.” In fact, of the 16 annual outages per year their survey respondents reported, 11 were caused by system failure resulting from complexity. And the more complex the environment, the more difficult the troubleshooting process can be, which can lead to longer, more harmful downtime experiences.

Thin client

Although not a direct result of virtualization, the industry has made yet another swing of the centralization versus decentralization pendulum. After years of powerful PCs loaded with local applications, we have entered an age of mobile, browser-based, and other very thin client solutions. In many cases, the client does little but collect bits of data for processing elsewhere. Not much can happen at the device level if the cloud-based or other computing resources are unavailable. The slightest problem can result in mounting user frustration as apps crash and error messages are returned.

In summary, the data center of 2018 houses servers that are doing more, for more internal and external customers. At the same time complexity is bringing about downtime risk with problems that can be more difficult to solve, which can lead to extended outages. Although effective failover, backup, and recovery processes can help mitigate the combined effects, these tactics alone are not enough.

Additional Solutions for Minimizing Server Downtime

It may sound old school, but data center managers need to stay focused on IT equipment. These failures account for 40% of all reported downtime. Compare that figure with the 25% caused by human error, whether by internal staff or service providers, and the 10% by cyberattacks. To have the greatest positive effect on uptime, hardware should obviously be the first target.

There are several recommendations data center managers should implement, if they haven’t already done so:

  • Perform routine maintenance regularly. It should go without saying but often doesn’t. Install recommended patches, check for physical issues like airflow blockages, and heed all alerts and warnings. Maintenance is fundamental but it is no less essential. That means training employees, scheduling tasks, and tracking completion. If maintenance can’t happen on time, all the time, seek outside assistance to get it done so available internal resources can focus on strategic projects and those unavoidable fire drills without leaving systems in jeopardy.
  • Monitor your resources. The first you hear of an outage should never be from a customer. Full-time, 24/7 systems monitoring is a must for any enterprise. Fortunately, there are new, AI-driven technologies combining monitoring with advanced predictive maintenance capabilities for immediate fault detection and integrated, quick-turnaround response. Access is less expensive than you might think.
  • Upgrade your break/fix plan. A disorganized parts closet or an eBay strategy won’t work. Rapid access to spares is vital in getting systems back online without delay. Especially for mission critical systems, station repair kits on site or work with a vendor who can do so and/or deliver spares within hours.
  • Invest in expertise. Parts are only part of the equation. There is significant skill involved in troubleshooting systems in these increasingly complex data center environments. The current IT skills gap may necessitate looking outside the enterprise to complement existing engineering capabilities with those of a third-party provider.
  • Test everything. Data centers evolve, but conducting proof-of-principle testing on each workload before any changes are made will cut down on virtualization problems before they happen. By the same token, systems recovery and DR scenarios are unknowns unless they are real-world verified. Try pulling a power cord and see what happens. Does that idea give you pause? It might be time for some enhancements.

There is good news for IT organizations already overwhelmed by demands to maintain more complex environments, execute the digital transformation, and achieve it all with fewer resources and less money, in a tight labor market to boot. Alternatives exist.

Third-party maintenance providers can take on a substantial portion of the equipment-related upkeep, troubleshooting, and support tasks in any data center. With a premium provider on board, it’s possible radically reduce downtime and reach the availability and reliability goals you’d hoped to achieve when you took the virtualization path in the first place.

By Paul Mercina

Paul Mercina

Director of Product Management

Paul Mercina brings over 20 years of experience in IT center project management to Park Place Technologies, where he has been a catalyst for shaping the evolutionary strategies of Park Place’s offering, tapping key industry insights and identifying customer pain points to help deliver the best possible service. A true visionary, Paul is currently implementing systems that will allow Park Place to grow and diversify their product offering based on customer needs for years to come.

His work is informed by more than a decade at Diebold Nixdorf, where he worked closely with software development teams to introduce new service design, supporting implementation of direct operations in a number of countries across the Americas, Asia and Europe that led to millions of dollars in cost savings for the company.

Mercina shares his technology and business expertise as an adjunct professor at Walsh University’s DeVille School of Business, where he instructs courses on business negotiations, business and project management, and marketing.

View Website

CONTRIBUTORS

A Look Beyond the Basics of Cloud Database Services: What’s Next for DBaaS?

A Look Beyond the Basics of Cloud Database Services: What’s Next for DBaaS?

Cloud Database Services When it comes to choosing the right database management system (DBMS), developers and data analysts today face ...
Chris

The Cloud Isn’t a Security Issue; It’s a Security Opportunity

Security Issue In order to stay ahead in today’s competitive business landscape, companies need to constantly innovate. Development teams must ...
Technology Influencer in Chief: 5 Steps to Success for Today’s CMOs

Technology Influencer in Chief: 5 Steps to Success for Today’s CMOs

Success for Today’s CMOs Being a CMO is an exhilarating experience - it's a lot like running a triathlon and ...
Lavabit, Edward Snowden and the Legal Battle For Privacy

Lavabit, Edward Snowden and the Legal Battle For Privacy

The Legal Battle For Privacy In early June 2013, Edward Snowden made headlines around the world when he leaked information ...
Leading Multicloud Strategies

Solving the Complexities of Leading Multicloud Strategies

Leading Multicloud Strategies To avoid the dreaded cloud lock-in, many organizations are now managing multiple clouds to service their business ...
Infosec thought leaders

Beyond VDI: How the hybrid cloud is forcing us to rethink an industry

Beyond VDI (Virtual Desktop Infrastructure) Before I start this blog, I want to get something off my chest. Here it ...
Quantum Computing opens new front in Cloud!

Quantum Computing opens new front in Cloud!

Quantum Computing As the amount of data in the world is rapidly increasing, so is the time required for machines to ...
Cyber Security Tips For Digital Collaboration

Cyber Security Tips For Digital Collaboration

Cyber Security Tips October is National Cyber Security Awareness Month – a joint effort by the Department of Homeland Security ...
How to secure personally identifiable information (PII) of customers

How to secure personally identifiable information (PII) of customers

Secure Personally Identifiable Information Information security has been a constant challenge for enterprises. Especially in a software test environment, enterprises face ...
Infosec thought leaders

Why you should add a connection broker to your suite of DevOps tools

DevOps Connection Broker When staring down the DevOps path, you have no lack of tools to help you pave the ...