Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Surely nobody who has the slightest awareness of what’s going on in the world can be unaware of the phrase ‘big data’. Almost every day the newspapers and television make reference to it, and it’s ubiquitous on the web. In November, a Google search for the phrase ‘big data’ yielded 1.8 billion hits. Google Trends shows that the rate of searches for the phrase is now about ten times what it was at the start of 2011.

The phrase defies an exact definition: one can define it in absolute terms (so many gigabytes, petabytes, etc) or in relative terms (relative to your computational resources), and in other ways. The obvious way for data to be big is by having many units (e.g., stars in an astronomical database), but it could also be big in terms of the number of variables (e.g., genomic data), the number of times something is observed (e.g., high frequency financial data), or by virtue of its complexity (e.g., the number of potential interactions in a social network).

Data-Scientists_Infographic

However one defines it, the point about ‘big data’ is the implied promise—of wonderful discoveries concealed within the huge mass, if only one can tease them out. That this is exactly the same promise that data mining made some twenty years ago is no accident. To a large extent, ‘big data’ is merely a media rebranding of ‘data mining’ (and of ‘business analytics’ in commercial contexts), and the media coining of the phrase ‘big data’ goes some way towards explaining the suddenness of the rise in interest.

Broadly speaking, there are two kinds of use of big data. One merely involves searching, sorting, matching, concatenating, and so on. So, for example, we get directions from Google maps, we learn how far away the next bus is, and we find a shop stocking the item we want. But the other use, and my personal feeling is there are more problems of this kind, involves inference. That is, we don’t actually want to know about the data we have but about data we might have had or might have in the future. What will happen tomorrow? Which medicine will make us better? What is the true value of some attribute? What would have happened had things been different? While computational tools are the keys to the first kind of problem, statistical tools are the keys to the second.

If big data is another take on data mining (looking at it from the resources end, rather than the tool end) then perhaps we can learn from the data mining experience. We might suspect, for example, that interesting and valuable discoveries will be few and far between, that many discoveries will turn out to be uninteresting, or obvious, or already well-known, and that most will be explainable by data errors. For example, big data sets are often accumulated as a side-effect of some other process—calculating how much to charge for a basket of supermarket purchases, deciding what prescription is appropriate for each patient, marking the exams of individual students—so we must be wary of issues such as selection bias. Statisticians are very aware of such things, but others are not.

As far as errors are concerned, a critical thing about big data is that the computer is a necessary intermediary: the only way you can look at the data is via plots, models, and diagnostics. You cannot examine a massive data set point by point. If data themselves are one step in a mapping from the phenomenon being studied, then looking at those data through the window of the computer is yet another step. No wonder errors and misunderstandings creep in.

Moreover, while there is no doubt that big data opens up new possibilities for discovery, that does not mean that ‘small data’ are redundant. Indeed, I might conjecture an informal theorem: the number of data sets of size n is inversely related to n. There will be vastly more small data sets than big ones, so we should expect proportionately more discoveries to emerge from small data sets.

Neither must we forget that data and information are not the same: it is possible to be data rich but information poor. The manure heap theorem is of relevance here. This mistaken theorem says that the probability of finding a gold coin in a heap of manure tends to 1 as the size of the heap tends towards infinity. Several times, after I’ve given talks about the potential of big data (stressing the need for effective tools, and describing the pitfalls outlined above), I have had people, typically from the commercial world, approach me to say that they’ve employed researchers to study their massive data sets, but to no avail: no useful information has been found.

Finally, the bottom line: to have any hope of extracting anything useful from big data, and to overcome the pitfalls outlined above, effective inferential skills are vital. That is, at the heart of extracting value from big data lies statistics.

David-J-HandBy David J Hand

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 25 books.

Original post can be seen in the Institute of Mathematical Statistics Bulletin, January/February 2014bulletin.imstat.org

About CloudTweaks

Established in 2009, CloudTweaks is recognized as one of the leading authorities in connected technology information and services.

We embrace and instill thought leadership insights, relevant and timely news related stories, unbiased benchmark reporting as well as offer green/cleantech learning and consultive services around the world.

Our vision is to create awareness and to help find innovative ways to connect our planet in a positive eco-friendly manner.

In the meantime, you may connect with CloudTweaks by following and sharing our resources.

View All Articles

Sorry, comments are closed for this post.

Comic
In The Fast Lane: Connected Car Hacking A Big Risk

In The Fast Lane: Connected Car Hacking A Big Risk

Connected Car Hacking Researchers and cybersecurity experts working hard to keep hackers out of the driver’s seat. Modern transportation has come a million miles, and most all of today’s vehicles are controlled entirely by digital technology. Millions of drivers are not aware that of the many devices in their digital arsenal, the most complex of…

Having Your Cybersecurity And Eating It Too

Having Your Cybersecurity And Eating It Too

The Catch 22 The very same year Marc Andreessen famously said that software was eating the world, the Chief Information Officer of the United States was announcing a major Cloud First goal. That was 2011. Five years later, as both the private and public sectors continue to adopt cloud-based software services, we’re interested in this…

Building a Data Security Strategy – More Important Than Ever

Building a Data Security Strategy – More Important Than Ever

Data Security Strategy Article sponsored by SAS Software and Big Data Forum Security and privacy have been an integral concern of the IT industry since its very inception, but as it expands through web-based, mobile, and cloud-based applications, access to data is magnified as are the threats of illicit penetration. As enterprises manage vast quantities…

Pitney Bowes Selects Aria Systems for Billing on the New Commerce Cloud

Pitney Bowes Selects Aria Systems for Billing on the New Commerce Cloud

Top-Ranked Cloud Billing Company Enables Greater Speed and Frictionless Billing for Unparalleled Customer Experience San Francisco, CA – August 23, 2016 – Aria Systems, which helps enterprises grow subscription and usage-based revenue, today announced that Pitney Bowes has selected Aria’s cloud-based monetization platform as the key billing and monetization component of their new Commerce Cloud…

The Golden Age of Wearable Technology

The Golden Age of Wearable Technology

The Golden Age One of the biggest fads in the technology sector right now is wearable tech. From Smartwatches that let you check your emails, chat with friends and search the web, to fitness accessories that monitor your heart rate and your sleep patterns, this is truly the Golden Age of wearable technology. But some…

Multi-Cloud Integration Has Arrived

Multi-Cloud Integration Has Arrived

Multi-Cloud Integration Speed, flexibility, and innovation require multiple cloud services As businesses seek new paths to innovation, racing to market with new features and products, cloud services continue to grow in popularity. According to Gartner, 88% of total compute will be cloud-based by 2020, leaving just 12% on premise. Flexibility remains a key consideration, and…

Virtual Immersion And The Extension/Expansion Of Virtual Reality

Virtual Immersion And The Extension/Expansion Of Virtual Reality

Virtual Immersion And Virtual Reality This is a term I created (Virtual Immersion). Ah…the sweet smell of Virtual Immersion Success! Virtual Immersion© (VI) an extension/expansion of Virtual Reality to include the senses beyond visual and auditory. Years ago there was a television commercial for a bathing product called Calgon. The tagline of the commercial was Calgon…

The Age of Data: The Era of Homo Digitus

The Age of Data: The Era of Homo Digitus

The Age of Data In our digital era data deluge – soaring amounts of data, is an overriding feature. That’s why it’s fitting to focus on the concept of Homo Digitus, which I first learned about about in“The creative destruction of medicine: How the digital revolution will create better health care,” by Eric Topol, and…

Which Is Better For Your Company: Cloud-Based or On-Premise ERP Deployment?

Which Is Better For Your Company: Cloud-Based or On-Premise ERP Deployment?

Cloud-Based or On-Premise ERP Deployment? You know how enterprise resource management (ERP) can improve processes within your supply chain, and the things to keep in mind when implementing an ERP system. But do you know if cloud-based or on-premise ERP deployment is better for your company or industry? While cloud computing is becoming more and…

6 Tech Predictions To Have A Major Impact In 2016

6 Tech Predictions To Have A Major Impact In 2016

6 Tech Predictions To Have A Major Impact The technology industry moves at a relentless pace, making it both exhilarating and unforgiving. For those at the forefront of innovation it is an incredibly exciting place to be, but what trends are we likely to see coming to the fore in 2016? Below are six predictions…

Infographic Introduction – Benefits of Cloud Computing

Infographic Introduction – Benefits of Cloud Computing

Benefits of Cloud Computing Based on Aberdeen Group’s Computer Intelligence Dataset, there are more than 1.6 billion permutations to choose from when it comes to cloud computing solutions. So what, on the face of it, appears to be pretty simple is actually both complex and dynamic regardless of whether you’re in the market for networking,…

5 Reasons Why Your Startup Will Grow Faster In The Cloud

5 Reasons Why Your Startup Will Grow Faster In The Cloud

Cloud Startup Fast-tracking Start-ups face many challenges, the biggest of which is usually managing growth. A start-up that does not grow is at constant risk of failure, whereas a new business that grows faster than expected may be hindered by operational constraints, such as a lack of staff, workspace and networks. It is an unfortunate…

15 Cloud Data Performance Monitoring Companies

15 Cloud Data Performance Monitoring Companies

Cloud Data Performance Monitoring Companies (Updated: Originally Published Feb 9th, 2015) We have decided to put together a small list of some of our favorite cloud performance monitoring services. In this day and age it is extremely important to stay on top of critical issues as they arise. These services will accompany you in monitoring…

The Rise Of BI Data And How To Use It Effectively

The Rise Of BI Data And How To Use It Effectively

The Rise of BI Data Every few years, a new concept or technological development is introduced that drastically improves the business world as a whole. In 1983, the first commercially handheld mobile phone debuted and provided workers with an unprecedented amount of availability, leading to more productivity and profits. More recently, the Cloud has taken…

Cloud Computing Checklist For Startups

Cloud Computing Checklist For Startups

Checklist For Startups  There are many people who aspire to do great things in this world and see new technologies such as Cloud computing and Internet of Things as a tremendous offering to help bridge and showcase their ideas. The Time Is Now This is a perfect time for highly ambitious startups to make some…