Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Surely nobody who has the slightest awareness of what’s going on in the world can be unaware of the phrase ‘big data’. Almost every day the newspapers and television make reference to it, and it’s ubiquitous on the web. In November, a Google search for the phrase ‘big data’ yielded 1.8 billion hits. Google Trends shows that the rate of searches for the phrase is now about ten times what it was at the start of 2011.

The phrase defies an exact definition: one can define it in absolute terms (so many gigabytes, petabytes, etc) or in relative terms (relative to your computational resources), and in other ways. The obvious way for data to be big is by having many units (e.g., stars in an astronomical database), but it could also be big in terms of the number of variables (e.g., genomic data), the number of times something is observed (e.g., high frequency financial data), or by virtue of its complexity (e.g., the number of potential interactions in a social network).

Data-Scientists_Infographic

However one defines it, the point about ‘big data’ is the implied promise—of wonderful discoveries concealed within the huge mass, if only one can tease them out. That this is exactly the same promise that data mining made some twenty years ago is no accident. To a large extent, ‘big data’ is merely a media rebranding of ‘data mining’ (and of ‘business analytics’ in commercial contexts), and the media coining of the phrase ‘big data’ goes some way towards explaining the suddenness of the rise in interest.

Broadly speaking, there are two kinds of use of big data. One merely involves searching, sorting, matching, concatenating, and so on. So, for example, we get directions from Google maps, we learn how far away the next bus is, and we find a shop stocking the item we want. But the other use, and my personal feeling is there are more problems of this kind, involves inference. That is, we don’t actually want to know about the data we have but about data we might have had or might have in the future. What will happen tomorrow? Which medicine will make us better? What is the true value of some attribute? What would have happened had things been different? While computational tools are the keys to the first kind of problem, statistical tools are the keys to the second.

If big data is another take on data mining (looking at it from the resources end, rather than the tool end) then perhaps we can learn from the data mining experience. We might suspect, for example, that interesting and valuable discoveries will be few and far between, that many discoveries will turn out to be uninteresting, or obvious, or already well-known, and that most will be explainable by data errors. For example, big data sets are often accumulated as a side-effect of some other process—calculating how much to charge for a basket of supermarket purchases, deciding what prescription is appropriate for each patient, marking the exams of individual students—so we must be wary of issues such as selection bias. Statisticians are very aware of such things, but others are not.

As far as errors are concerned, a critical thing about big data is that the computer is a necessary intermediary: the only way you can look at the data is via plots, models, and diagnostics. You cannot examine a massive data set point by point. If data themselves are one step in a mapping from the phenomenon being studied, then looking at those data through the window of the computer is yet another step. No wonder errors and misunderstandings creep in.

Moreover, while there is no doubt that big data opens up new possibilities for discovery, that does not mean that ‘small data’ are redundant. Indeed, I might conjecture an informal theorem: the number of data sets of size n is inversely related to n. There will be vastly more small data sets than big ones, so we should expect proportionately more discoveries to emerge from small data sets.

Neither must we forget that data and information are not the same: it is possible to be data rich but information poor. The manure heap theorem is of relevance here. This mistaken theorem says that the probability of finding a gold coin in a heap of manure tends to 1 as the size of the heap tends towards infinity. Several times, after I’ve given talks about the potential of big data (stressing the need for effective tools, and describing the pitfalls outlined above), I have had people, typically from the commercial world, approach me to say that they’ve employed researchers to study their massive data sets, but to no avail: no useful information has been found.

Finally, the bottom line: to have any hope of extracting anything useful from big data, and to overcome the pitfalls outlined above, effective inferential skills are vital. That is, at the heart of extracting value from big data lies statistics.

David-J-HandBy David J Hand

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 25 books.

Original post can be seen in the Institute of Mathematical Statistics Bulletin, January/February 2014bulletin.imstat.org

About CloudTweaks

Established in 2009, CloudTweaks is recognized as one of the leading authorities in connected technology information and services.

We embrace and instill thought leadership insights, relevant and timely news related stories, unbiased benchmark reporting as well as offer green/cleantech learning and consultive services around the world.

Our vision is to create awareness and to help find innovative ways to connect our planet in a positive eco-friendly manner.

In the meantime, you may connect with CloudTweaks by following and sharing our resources.

View All Articles

Sorry, comments are closed for this post.

Comic
When Sci-Fi Predictions Come To Fruition

When Sci-Fi Predictions Come To Fruition

Evolution of Technologies To paraphrase science fiction author Arthur C. Clark, those who make predictions about the future are either “considered conservative now and mocked later, or mocked now and proved right when they are no longer around to enjoy the acclaim.” The one thing we can be sure about, Clark ventured, is that “[the…

Facebook Hopes To Extend Internet Connectivity With Solar-Powered Drones

Facebook Hopes To Extend Internet Connectivity With Solar-Powered Drones

Facebook Inc (FB.O) said on Thursday it had completed a successful test flight of a solar-powered drone that it hopes will help it extend internet connectivity to every corner of the planet. Aquila, Facebook’s lightweight, high-altitude aircraft, flew at a few thousand feet for 96 minutes in Yuma, Arizona, Chief Executive Mark Zuckerberg wrote in…

When Will Women In Tech Become The Norm?

When Will Women In Tech Become The Norm?

Tech Diversity It is well known that the technology industry has been dominated by men, but it is also clear that the industry is working to change that. Diversity in the tech industry, especially where it applies to women in tech, has been a topic of discussion for years. Recently the Washington Technology Industry Association…

Four Keys For Telecoms Competing In A Digital World

Four Keys For Telecoms Competing In A Digital World

Competing in a Digital World Telecoms, otherwise largely known as Communications Service Providers (CSPs), have traditionally made the lion’s share of their revenue from providing pipes and infrastructure. Now CSPs face increased competition, not so much from each other, but with digital service providers (DSPs) like Netflix, Google, Amazon, Facebook, and Apple, all of whom…

Edtech and Virtual Reality – Exciting Learning Environment

Edtech and Virtual Reality – Exciting Learning Environment

Customizing Edutech Customized edtech learning solutions are becoming more commonplace as the education industry recognises their potential and begins transforming the traditional structures so as to incorporate innovative developments. From textbooks to tablets, chalkboards to virtual reality, edtech promises not only dynamic and exciting learning environments but better learning strategies and solutions. Virtual Reality and…

Get Ready For Virtual Reality and the Cloud

Get Ready For Virtual Reality and the Cloud

Virtual Reality Cloud We’re lucky to live in an era where virtual reality is no longer relegated to the confines of a sci-fi movie universe. Thanks to technology introduced by products like Oculus Rift, consumers now have access to virtual environments with fully immersive graphic capabilities. As a result, companies have only just begun to…

Maintaining Network Performance And Security In Hybrid Cloud Environments

Maintaining Network Performance And Security In Hybrid Cloud Environments

Hybrid Cloud Environments After several years of steady cloud adoption in the enterprise, an interesting trend has emerged: More companies are retaining their existing, on-premise IT infrastructures while also embracing the latest cloud technologies. In fact, IDC predicts markets for such hybrid cloud environments will grow from the over $25 billion global market we saw…

Don’t Be Intimidated By Data Governance

Don’t Be Intimidated By Data Governance

Data Governance Data governance, the understanding of the raw data of an organization is an area IT departments have historically viewed as a lose-lose proposition. Not doing anything means organizations run the risk of data loss, data breaches and data anarchy – no control, no oversight – the Wild West with IT is just hoping…

Four Keys For Telecoms Competing In A Digital World

Four Keys For Telecoms Competing In A Digital World

Competing in a Digital World Telecoms, otherwise largely known as Communications Service Providers (CSPs), have traditionally made the lion’s share of their revenue from providing pipes and infrastructure. Now CSPs face increased competition, not so much from each other, but with digital service providers (DSPs) like Netflix, Google, Amazon, Facebook, and Apple, all of whom…

Virtual Immersion And The Extension/Expansion Of Virtual Reality

Virtual Immersion And The Extension/Expansion Of Virtual Reality

Virtual Immersion And Virtual Reality This is a term I created (Virtual Immersion). Ah…the sweet smell of Virtual Immersion Success! Virtual Immersion© (VI) an extension/expansion of Virtual Reality to include the senses beyond visual and auditory. Years ago there was a television commercial for a bathing product called Calgon. The tagline of the commercial was Calgon…

Why Cloud Compliance Doesn’t Need To Be So Overly Complicated

Why Cloud Compliance Doesn’t Need To Be So Overly Complicated

Cloud Compliance  Regulatory compliance is an issue that has not only weighed heavily on the minds of executives, security and audit teams, but also today, even end users. Public cloud adds more complexity when varying degrees of infrastructure (depending on the cloud model) and data fall out of the hands of the company and into…

Protecting Devices From Data Breach: Identity of Things (IDoT)

Protecting Devices From Data Breach: Identity of Things (IDoT)

How to Identify and Authenticate in the Expanding IoT Ecosystem It is a necessity to protect IoT devices and their associated data. As the IoT ecosystem continues to expand, the need to create an identity to newly-connected things is becoming increasingly crucial. These ‘things’ can include anything from basic sensors and gateways to industrial controls…

Disaster Recovery – A Thing Of The Past!

Disaster Recovery – A Thing Of The Past!

Disaster Recovery  Ok, ok – I understand most of you are saying disaster recovery (DR) is still a critical aspect of running any type of operations. After all – we need to secure our future operations in case of disaster. Sure – that is still the case but things are changing – fast. There are…

The Monstrous IoT Connected Cloud Market

The Monstrous IoT Connected Cloud Market

What’s Missing in the IoT? While the Internet of Things has become a popular concept among tech crowds, the consumer IoT remains fragmented. Top companies continue to battle to decide who will be the epicenter of the smart home of the future, creating separate ecosystems (like the iOS and Android smartphone market) in their wake.…

Driving Success: 6 Key Metrics For Every Recurring Revenue Business

Driving Success: 6 Key Metrics For Every Recurring Revenue Business

Recurring Revenue Business Metrics Recurring revenue is the secret sauce behind the explosive growth of powerhouses like Netflix and Uber. Unsurprisingly, recurring revenue is also quickly gaining ground in more traditional industries like healthcare and the automotive business. In fact, nearly half of U.S. businesses have adopted or are planning to adopt a recurring revenue model,…

Big Data – Top Critical Technology Trend For The Next Five Years

Big Data – Top Critical Technology Trend For The Next Five Years

Big Data Future Today’s organizations should become more collaborative, virtual, adaptive, and agile in order to be successful in complex business world. They should be able to respond to changes and market needs. Many organizations found that the valuable data they possess and how they use it can make them different than others. In fact,…