Bill Schmarzo

Great Data Scientists Don’t Just Think Outside the Box, They Redefine the Box

Redefine the Box

Special thanks to Michael Shepherd, AI Research Strategist, Dell EMC Services, for his co-authorship. Learn more about Michael at the bottom of this post.

Imagine you wanted to determine how much solar energy could be generated from adding solar cells to a particular house. This is what Google’s Project Sunroof does with Deep Learning. Enter an address and Google uses a Deep Learning framework to estimate how much money you could save in energy costs with solar cells over 20 years (see Figure 1).

Figure 1: Google Project Sunroof Project

It’s a very cool application of Deep Learning. But let’s assume there “might” be an even better way to estimate solar energy savings. For example, you want to use Deep Learning to estimate how much solar energy we could generate with solar panels on the Golden Gate Bridge (that probably wouldn’t be a very popular decision in San Francisco). The obvious application would be to analyze several photos of the Golden Gate Bridge and estimate clear skies based upon cloud coverage.

However instead of estimating the potential solar energy generation based upon “cloud coverage,” what if we wanted to use “sunlight reflection” to generate the solar energy estimate (see Figure 2)?

Figure 2: Determining Best Predictive Variables for the Golden Gate Bridge

Or maybe you want to test another metric based upon the “sharpness of the shadows” generated by the bridge? Or another metric based upon how many people in the photo are wearing sunglasses? Or yet another metric based upon…

How do you know which of these variables – clouds or reflection or shadows or sunglasses or anything else – is the better predictor of solar energy generation? You try them all!

This thought process highlights an important behavioral trait of the best data scientists; the best data scientists have strong imaginative skills for not just “thinking outside the box” – but actually redefining the box – in trying to find variables and metrics that might be better predictors of performance.

The word “might” is a powerful enabler. “Might” is used to say or indicate that something is possible. It’s a data scientist’s most important concept, because “might” gives the data scientist the license to explore, be wrong, learn and try again.

“It Can’t Be Done” Is Not a Data Scientist Term

Andrew Ng, artificial intelligence visionary and fearless leader for many of us, wrote a recent article titled, “What Artificial Intelligence Can and Can’t Do Right Now.” In the article, Andrew states the following:

“Surprisingly, despite AI’s breadth of impact, the types of it being deployed are still extremely limited. Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B). For example:”

Figure 3: What Machine Learning Can Do

While the use cases are limited today, the creativity at which data scientists are leveraging Big Data and existing Machine Learning and Deep Learning technologies is staggering. Let me give you one example of how data scientists from one of our Services teams at Dell EMC are thinking outside the box, to uncover new ways to help our customers avoid issues in their IT environment and create a more effortless support experience.

Predicting Hard Drive Failures

Let’s say that you are capturing over 260+ different pieces of telemetry data several times a minute for the life of a device. Most of these 260+ variables have incomplete or sparse data, the collection timing doesn’t always line up nice and neat, and getting time continuity across the devices is a major challenge.

If you were using a traditional Machine Learning algorithm, the data science team would have to spend an overwhelming amount of time 1) feature engineering new variables based on domain knowledge, and 2) using trial-and-error to determine which combinations of variables should even be included in the Machine Learning model.

Instead, our Dell EMC Services data scientists used a Patent Pending approach to Deep Learning to “pixelate” the data. They turned the over 260+ variables into device performance “images.” Then once they created these “images,” the team leveraged a recurrent neural network to find “shapes” and repeatable patterns out of random pixels (see Figure 4).

Figure 4: Pixelating Telemetry Data

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. RNNs can use their internal memory to process arbitrary sequences of inputs, which typically makes RNNs ideal for handwriting or speech recognition. Except in this case, instead of trying to decipher handwriting into words, the data science team used the RNN to decipher the seemingly random pixels into a prediction on the state of the device (see Figure 5).

Figure 5: Using RNN’s to Identify Shapes and Patterns Buried in the Telemetry Data

I love this example because the team didn’t feel constrained to try to fit the square peg into the round “Machine Learning” hole. Instead, they used Deep Learning in a different context to decipher seemingly random pixels into a prediction of the health of a device. The data scientists didn’t wait until someone developed a better Machine Learning algorithm. Instead, they looked at the wide variety of Machine Learning and Deep Learning tools and algorithms available to them, and applied them to a different, but related use case. If we can predict the health of a device and the potential problems that could occur with that device, then we can also help customers prevent those problems, significantly enhancing their support experience and positively impacting their environment.


One of a data scientist’s most important characteristics is that they refuse to take “it can’t be done” as an answer. They are willing to try different variables and metrics, and different type of advanced analytic algorithms, to see if there is another way to predict performance.

By the way, I included this image just because I thought it was cool. This graphic measures the activity between different IT systems. Just like with data science, this image shows there’s no lack of variables to consider when building your Machine Learning and Deep Learning models!

Want more information on how Dell EMC Services uses data science?

Check out the “Decoding Customer DNA with Data Science” blog by Doug Schmitt, President, Dell EMC Global Services, and watch for the upcoming podcasts “A Conversation with Two Data Geeks” to hear directly from the data scientists behind our transformative technologies.

I would like to thank my co-author Michael Shepherd, AI Research Strategist, Dell EMC Services. Michael holds U.S. patents in both hardware and software and is a Technical Evangelist who provides vision through transformational AI data science. With experience in supply chain, manufacturing and services, he enjoys demonstrating real scenarios with the SupportAssist Intelligence Engine showing how predictive and proactive AI platforms running at the “speed of thought” are feasible in every industry.

By Bill Schmarzo

Bill Schmarzo

CTO, IoT and Analytics at Hitachi Vantara (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Three Reasons Cloud Adoption Can Close The Federal Government’s Tech Gap

Three Reasons Cloud Adoption Can Close The Federal Government’s Tech Gap

Federal Government Cloud Adoption No one has ever accused the U.S. government of being technologically savvy. Aging software, systems and processes, ...
How artificial intelligence and analytics helps in crime prevention

How artificial intelligence and analytics helps in crime prevention

How Artificial Intelligence Helps Crime Prevention According to a study released by FBI, there is an annual increase of 4.1% ...
IoT Security Intel

Cyber IoT Security: McAfee on Threats and Autonomous Cars

IoT Security Autonomous cars are just around the corner, there have been IoT security controversies surrounding their safety, and a ...
Write Once, Run Anywhere: The IoT Machine Learning Shift From Proprietary Technology To Data

Write Once, Run Anywhere: The IoT Machine Learning Shift From Proprietary Technology To Data

The IoT Machine Learning Shift While early artificial intelligence (AI) programs were a one-trick pony, typically only able to excel ...
Three Tips To Simplify Governance, Risk and Compliance

Three Tips To Simplify Governance, Risk and Compliance

Governance, Risk and Compliance Businesses are under pressure to deliver against a backdrop of evolving regulations and security threats. In ...
Minna Wang

Using Cloud Technology In The Education Industry

Student Collaboration Arguably one of society's most important functions, teaching can still seem antiquated at times. Many schools still function ...
F-Secure Takes A Big Step Towards Cyber Security Leadership By Acquiring MWR InfoSecurity

F-Secure Takes A Big Step Towards Cyber Security Leadership By Acquiring MWR InfoSecurity

Acquisition adds industry leading threat hunting platform to F-Secure’s detection and response offering and expands cyber security services to the biggest markets globally F-Secure Corporation, Stock Exchange Release 18 June, 2018 at 09:00 EEST F-Secure ...
Worldwide Cloud IT Infrastructure Revenues Continue to Grow by Double Digits in the First Quarter of 2018

Worldwide Cloud IT Infrastructure Revenues Continue to Grow by Double Digits in the First Quarter of 2018

FRAMINGHAM, Mass., June 21, 2018 – According to the International Data Corporation (IDC) Worldwide Quarterly Cloud IT Infrastructure Tracker, vendor revenue from sales of infrastructure products (server, storage, and Ethernet switch) for cloud IT, including public and ...
Palo Alto Networks Commitment to Educating European CEOs and Boards on Cybersecurity as a Business Issue

Palo Alto Networks Commitment to Educating European CEOs and Boards on Cybersecurity as a Business Issue

In recent years, the topic of cybersecurity awareness, education, training, and skills has grown in importance across the European Union. On each trip I take to Brussels, I am struck by how this is a ...