When To Use Supervised And Unsupervised Data Mining

Social Data

Supervised And Unsupervised Data Mining

Data mining techniques come in two main forms: supervised (also known as predictive or directed) and unsupervised (also known as descriptive or undirected). Both categories encompass functions capable of finding different hidden patterns in large data sets.

Although data analytics tools are placing more emphasis on self service, it’s still useful to know which data mining operation is appropriate for your needs before you begin a data mining operation.

Supervised And Unsupervised Data Mining

(Infographic Source: New Jersey Institute of Technology)

Supervised Data Mining

Supervised data mining techniques are appropriate when you have a specific target value you’d like to predict about your data. The targets can have two or more possible outcomes, or even be a continuous numeric value (more on that later).

To use these methods, you ideally have a subset of data points for which this target value is already known. You use that data to build a model of what a typical data point looks like when it has one of the various target values. You then apply that model to data for which that target value is currently unknown. The algorithm identifies the “new” data points that match the model of each target value.

Now let’s clarify that with some specific demonstrations:

Classification

As a supervised data mining method, classification begins with the method described above.

Imagine you’re a credit card company and you want to know which customers are likely to default on their payments in the next few years.

You use the data on customers who have and have not defaulted for extended periods of time as build data (or training data) to generate a classification model. You then run that model on the customers you’re curious about. The algorithms will look for customers whose attributes match the attribute patterns of previous defaulters/non-defaulters, and categorize them according to which group they most closely match. You can then use these groupings as indicators of which customers are most likely to default.

Similarly, a classification model can have more than two possible values in the target attribute. The values could be anything from the shirt colors they’re most likely to buy, the promotional methods they’ll respond to (mail, email, phone), or whether or not they’ll use a coupon.

Regression

Regression is similar to classification except that the targeted attribute’s values are numeric, rather than categorical. The order or magnitude of the value is significant in some way.

To reuse the credit card example, if you wanted to know what threshold of debt new customers are likely to accumulate on their credit card, you would use a regression model.

Simply supply data from current and past customers with their maximum previous debt level as the target value, and a regression model will be built on that training data. Once run on the new customers, the regression model will match attribute values with predicted maximum debt levels and assign the predictions to each customer accordingly.

This could be used to predict the age of customers with demographic and purchasing data, or to predict the frequency of insurance claims.

Anomaly Detection

Anomaly detection identifies data points atypical of a given distribution. In other words, it finds the outliers. Though simpler data analysis techniques than full-scale data mining can identify outliers, data mining anomaly detection techniques identify much more subtle attribute patterns and the data points that fail to conform to those patterns.

Most examples of anomaly detection uses involve fraud detection, such as for insurance or credit card companies.

Unsupervised Data Mining

Unsupervised data mining does not focus on predetermined attributes, nor does it predict a target value. Rather, unsupervised data mining finds hidden structure and relation among data.

Clustering

The most open-ended data-mining technique, clustering algorithms, finds and groups data points with natural similarities.

This is used when there are no obvious natural groupings, in which case the data may be difficult to explore. Clustering the data can reveal groups and categories you were previously unaware of. These new groups may be fit for further data mining operations from which you may discover new correlations.

Association

Frequently used for market basket analysis, association models identify common co-occurrences among a list of possible events. Market basket analysis is examining all items available in a particular medium, such as the products on store shelves or in a catalogue, and finding the products that are commonly sold together.

This operation produces association rules. Such a rule could be a statement declaring “80 percent of people who buy charcoal, hamburger meat, and buns also buy sliced cheese,” or, in a less “market basket” style example, “90 percent of Detroit citizens who root for the Tigers, the Lions, and the Pistons also favor the Red Wings over other hockey teams.”

Such rules can be used to personalize the customer experience to promote certain events or actions. This can be accomplished by organizing store shelves with associated items nearby, or by tracking customer movements through a website in real time to present them with relevant product links.

Feature Extraction

Feature extraction creates new features based on attributes of your data. These new features describe a combination of significant attribute value patterns in your data.

If violence, heroism, and fast cars were attributes of a movie, then the feature may be “action,” akin to a genre or a theme. This concept can be used to extract the themes of a document based on the frequencies of certain key words.

Representing data points by their features can help compress the data (trading dozens of attributes for one feature), make predictions (data with this feature often has these attributes as well), and recognize patterns. Additionally, features can be used as new attributes, which can improve the efficiency and accuracy of supervised learning techniques (classification, regression, anomaly detection, etc.).

Knowing your goals and the appropriate techniques to achieve them can help your data mining operations run smoothly and effectively. Different data is appropriate for different insight and understanding what you’re asking from your data analysts expedites the process for everyone.

By Keith Cawley

DivvyCloud Podcast

Episode 7: Haste Makes Waste: The Dangers of Rushing to the Cloud

Dangers of Rushing to the Cloud The pressure to accelerate your company’s plans to move to the public cloud is substantial. But it should never be taken lightly. It’s a democratized world far away from ...
Sam Bowman

Medical Data Online: What Are the Risks?

Medical Data in the Cloud Executive-level healthcare leaders must have a thorough understanding of cloud-based security risks. Patient data can vary from simple information such as height and weight to complex genetic markers that can ...
Printing Industry

How to Choose the Right Cloud Printing Solution for Your Business

Cloud Printing Business Solutions The demand for cloud printing is primarily driven by the overall organizational benefits of Software as a Service (SaaS) portfolio. The expectation of flexibility in workplace tools, of plug-and-play solutions for ...
Tej Redkar

How AI Monitoring Can Make Your Business Smarter and Better

Business AI Monitoring When issues arise with digital technology—as they invariably do—companies must have the ability to fix them before they create any business impact. These days, more and more companies are discovering that the ...
New York

From Y2K To NYC Parking Meters: Have We Learned Anything About Complacency In Cybersecurity?

Cybersecurity Complacency This past January – in what seems like a different world now – a story briefly hit the headlines and was seen as more of a quirk than a threat. It was soon ...
Episode 4: The Power of Regulatory Compliant Cloud: A European Case Study

Episode 4: The Power of Regulatory Compliant Cloud: A European Case Study

An interview with Johan Christenson, CEO of CityNetwork With the world focusing on the big three hyperscalers, there is still room – and much necessity for – more local cloud providers who are better suited ...