Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Surely nobody who has the slightest awareness of what’s going on in the world can be unaware of the phrase ‘big data’. Almost every day the newspapers and television make reference to it, and it’s ubiquitous on the web. In November, a Google search for the phrase ‘big data’ yielded 1.8 billion hits. Google Trends shows that the rate of searches for the phrase is now about ten times what it was at the start of 2011.

The phrase defies an exact definition: one can define it in absolute terms (so many gigabytes, petabytes, etc) or in relative terms (relative to your computational resources), and in other ways. The obvious way for data to be big is by having many units (e.g., stars in an astronomical database), but it could also be big in terms of the number of variables (e.g., genomic data), the number of times something is observed (e.g., high frequency financial data), or by virtue of its complexity (e.g., the number of potential interactions in a social network).

Data-Scientists_Infographic

However one defines it, the point about ‘big data’ is the implied promise—of wonderful discoveries concealed within the huge mass, if only one can tease them out. That this is exactly the same promise that data mining made some twenty years ago is no accident. To a large extent, ‘big data’ is merely a media rebranding of ‘data mining’ (and of ‘business analytics’ in commercial contexts), and the media coining of the phrase ‘big data’ goes some way towards explaining the suddenness of the rise in interest.

Broadly speaking, there are two kinds of use of big data. One merely involves searching, sorting, matching, concatenating, and so on. So, for example, we get directions from Google maps, we learn how far away the next bus is, and we find a shop stocking the item we want. But the other use, and my personal feeling is there are more problems of this kind, involves inference. That is, we don’t actually want to know about the data we have but about data we might have had or might have in the future. What will happen tomorrow? Which medicine will make us better? What is the true value of some attribute? What would have happened had things been different? While computational tools are the keys to the first kind of problem, statistical tools are the keys to the second.

If big data is another take on data mining (looking at it from the resources end, rather than the tool end) then perhaps we can learn from the data mining experience. We might suspect, for example, that interesting and valuable discoveries will be few and far between, that many discoveries will turn out to be uninteresting, or obvious, or already well-known, and that most will be explainable by data errors. For example, big data sets are often accumulated as a side-effect of some other process—calculating how much to charge for a basket of supermarket purchases, deciding what prescription is appropriate for each patient, marking the exams of individual students—so we must be wary of issues such as selection bias. Statisticians are very aware of such things, but others are not.

As far as errors are concerned, a critical thing about big data is that the computer is a necessary intermediary: the only way you can look at the data is via plots, models, and diagnostics. You cannot examine a massive data set point by point. If data themselves are one step in a mapping from the phenomenon being studied, then looking at those data through the window of the computer is yet another step. No wonder errors and misunderstandings creep in.

Moreover, while there is no doubt that big data opens up new possibilities for discovery, that does not mean that ‘small data’ are redundant. Indeed, I might conjecture an informal theorem: the number of data sets of size n is inversely related to n. There will be vastly more small data sets than big ones, so we should expect proportionately more discoveries to emerge from small data sets.

Neither must we forget that data and information are not the same: it is possible to be data rich but information poor. The manure heap theorem is of relevance here. This mistaken theorem says that the probability of finding a gold coin in a heap of manure tends to 1 as the size of the heap tends towards infinity. Several times, after I’ve given talks about the potential of big data (stressing the need for effective tools, and describing the pitfalls outlined above), I have had people, typically from the commercial world, approach me to say that they’ve employed researchers to study their massive data sets, but to no avail: no useful information has been found.

Finally, the bottom line: to have any hope of extracting anything useful from big data, and to overcome the pitfalls outlined above, effective inferential skills are vital. That is, at the heart of extracting value from big data lies statistics.

David-J-HandBy David J Hand

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 25 books.

Original post can be seen in the Institute of Mathematical Statistics Bulletin, January/February 2014bulletin.imstat.org

About CloudTweaks

Established in 2009, CloudTweaks is recognized as one of the leading authorities in connected technology information and services.

We embrace and instill thought leadership insights, relevant and timely news related stories, unbiased benchmark reporting as well as offer green/cleantech learning and consultive services around the world.

Our vision is to create awareness and to help find innovative ways to connect our planet in a positive eco-friendly manner.

In the meantime, you may connect with CloudTweaks by following and sharing our resources.

View All Articles

Sorry, comments are closed for this post.

Red Hat Offers Container Native Persistent Storage for Linux Containers

Red Hat Offers Container Native Persistent Storage for Linux Containers

Red Hat Offers Container Storage Latest Red Hat Gluster Storage release enables greater agility and efficiency for OpenShift developers deploying application containers in production SAN FRANCISCO – RED HAT SUMMIT – June 28, 2016 – Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced new storage innovations designed to enable developers to…

Dismissal Of Class Action Lawsuit A Setback For Internet Privacy

Dismissal Of Class Action Lawsuit A Setback For Internet Privacy

A Setback For Internet Privacy On Monday the Third U.S. Circuit Court of Appeals (a federal appeals court) unanimously dismissed a class action lawsuit by parents of children under the age of 13 who had used Nickelodeon’s websites against Google and Viacom (which owns the Nickelodeon websites). This was a result of litigation beginning in…

Controversial Chinese Cybersecurity Law Under Review Again

Controversial Chinese Cybersecurity Law Under Review Again

Cybersecurity Law BEIJING. The National People’s Congress, the equivalence of the Chinese Parliament, moved forward in drafting a second version of a controversial cybersecurity law first introduced almost a year ago. This means the law is thought to be closer to passing and will bring greater censorship for both foreign and domestic citizens and businesses.…

Personal Account of Google CEO Compromised

Personal Account of Google CEO Compromised

Personal Account Compromised The security of our information online, whether it’s our banking details, emails or personal information, is important. Hackers pose a very real threat to our privacy when there are vulnerabilities in the security of the services we use online. It can be worrying then when the CEO of perhaps the largest holder…

How You Can Improve Customer Experience With Fast Data Analytics

How You Can Improve Customer Experience With Fast Data Analytics

Fast Data Analytics In today’s constantly connected world, customers expect more than ever before from the companies they do business with. With the emergence of big data, businesses have been able to better meet and exceed customer expectations thanks to analytics and data science. However, the role of data in your business’ success doesn’t end…

Most Active Internet Of Things Investors In The Last 5 Years

Most Active Internet Of Things Investors In The Last 5 Years

Most Active Internet Of Things Investors A recent BI Intelligence report claimed that the Internet of Things (IoT) is on its way to becoming the largest device market in the world. Quite naturally, such exponential growth of the IoT market has prompted a number of high-profile corporate investors and smart money VCs to bet highly…

SaaS And The Cloud Are Still Going Strong

SaaS And The Cloud Are Still Going Strong

SaaS And The Cloud With the results of Cisco Global Could Index: 2013-2018 and Hosting and Cloud Study 2014, predictions for the future of cloud computing are notable. Forbes reported that spending on infrastructure-related services has increased as public cloud computing uptake spreads, and reflected on Gartner’s Public Cloud Services Forecast. The public cloud service…

Infographic: IoT Programming Essential Job Skills

Infographic: IoT Programming Essential Job Skills

Learning To Code As many readers may or may not know we cover a fair number of topics surrounding new technologies such as Big data, Cloud computing , IoT and one of the most critical areas at the moment – Information Security. The trends continue to dictate that there is a huge shortage of unfilled…

Who’s Who In The Booming World Of Data Science

Who’s Who In The Booming World Of Data Science

The World of Data Science The nature of work and business in today’s super-connected world means that every second of every day, the world produces an astonishing amount of data. Consider some of these statistics; every minute, Facebook users share nearly 2.5 million pieces of content, YouTube users upload over 72 hours of content, Apple…

The Future Of Work: What Cloud Technology Has Allowed Us To Do Better

The Future Of Work: What Cloud Technology Has Allowed Us To Do Better

What Cloud Technology Has Allowed Us to Do Better The cloud has made our working lives easier, with everything from virtually unlimited email storage to access-from-anywhere enterprise resource planning (ERP) systems. It’s no wonder the 2013 cloud computing research IDG survey revealed at least 84 percent of the companies surveyed run at least one cloud-based application.…

Cloud Infographic: The Explosive Growth Of The Cloud

Cloud Infographic: The Explosive Growth Of The Cloud

The Explosive Growth Of The Cloud We’ve been covering cloud computing extensively over the past number of years on CloudTweaks and have truly enjoyed watching the adoption and growth of it. Many novices are still trying to wrap their mind around what the cloud it is and what it does, while others such as thought…

Cloud Infographic – Guide To Small Business Cloud Computing

Cloud Infographic – Guide To Small Business Cloud Computing

Small Business Cloud Computing Trepidation is inherently attached to anything that involves change and especially if it involves new technologies. SMBs are incredibly vulnerable to this fear and rightfully so. The wrong security breach can incapacitate a small startup for good whereas larger enterprises can reboot their operations due to the financial stability of shareholders. Gordon Tan contributed an…