Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Hand Writing: Data, Data, Everywhere, But Let’s Just Stop And Think

Surely nobody who has the slightest awareness of what’s going on in the world can be unaware of the phrase ‘big data’. Almost every day the newspapers and television make reference to it, and it’s ubiquitous on the web. In November, a Google search for the phrase ‘big data’ yielded 1.8 billion hits. Google Trends shows that the rate of searches for the phrase is now about ten times what it was at the start of 2011.

The phrase defies an exact definition: one can define it in absolute terms (so many gigabytes, petabytes, etc) or in relative terms (relative to your computational resources), and in other ways. The obvious way for data to be big is by having many units (e.g., stars in an astronomical database), but it could also be big in terms of the number of variables (e.g., genomic data), the number of times something is observed (e.g., high frequency financial data), or by virtue of its complexity (e.g., the number of potential interactions in a social network).

Data-Scientists_Infographic

However one defines it, the point about ‘big data’ is the implied promise—of wonderful discoveries concealed within the huge mass, if only one can tease them out. That this is exactly the same promise that data mining made some twenty years ago is no accident. To a large extent, ‘big data’ is merely a media rebranding of ‘data mining’ (and of ‘business analytics’ in commercial contexts), and the media coining of the phrase ‘big data’ goes some way towards explaining the suddenness of the rise in interest.

Broadly speaking, there are two kinds of use of big data. One merely involves searching, sorting, matching, concatenating, and so on. So, for example, we get directions from Google maps, we learn how far away the next bus is, and we find a shop stocking the item we want. But the other use, and my personal feeling is there are more problems of this kind, involves inference. That is, we don’t actually want to know about the data we have but about data we might have had or might have in the future. What will happen tomorrow? Which medicine will make us better? What is the true value of some attribute? What would have happened had things been different? While computational tools are the keys to the first kind of problem, statistical tools are the keys to the second.

If big data is another take on data mining (looking at it from the resources end, rather than the tool end) then perhaps we can learn from the data mining experience. We might suspect, for example, that interesting and valuable discoveries will be few and far between, that many discoveries will turn out to be uninteresting, or obvious, or already well-known, and that most will be explainable by data errors. For example, big data sets are often accumulated as a side-effect of some other process—calculating how much to charge for a basket of supermarket purchases, deciding what prescription is appropriate for each patient, marking the exams of individual students—so we must be wary of issues such as selection bias. Statisticians are very aware of such things, but others are not.

As far as errors are concerned, a critical thing about big data is that the computer is a necessary intermediary: the only way you can look at the data is via plots, models, and diagnostics. You cannot examine a massive data set point by point. If data themselves are one step in a mapping from the phenomenon being studied, then looking at those data through the window of the computer is yet another step. No wonder errors and misunderstandings creep in.

Moreover, while there is no doubt that big data opens up new possibilities for discovery, that does not mean that ‘small data’ are redundant. Indeed, I might conjecture an informal theorem: the number of data sets of size n is inversely related to n. There will be vastly more small data sets than big ones, so we should expect proportionately more discoveries to emerge from small data sets.

Neither must we forget that data and information are not the same: it is possible to be data rich but information poor. The manure heap theorem is of relevance here. This mistaken theorem says that the probability of finding a gold coin in a heap of manure tends to 1 as the size of the heap tends towards infinity. Several times, after I’ve given talks about the potential of big data (stressing the need for effective tools, and describing the pitfalls outlined above), I have had people, typically from the commercial world, approach me to say that they’ve employed researchers to study their massive data sets, but to no avail: no useful information has been found.

Finally, the bottom line: to have any hope of extracting anything useful from big data, and to overcome the pitfalls outlined above, effective inferential skills are vital. That is, at the heart of extracting value from big data lies statistics.

David-J-HandBy David J Hand

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 25 books.

Original post can be seen in the Institute of Mathematical Statistics Bulletin, January/February 2014bulletin.imstat.org

About CloudTweaks

Established in 2009, CloudTweaks is recognized as one of the leading authorities in connected technology information and services.

We embrace and instill thought leadership insights, relevant and timely news related stories, unbiased benchmark reporting as well as offer green/cleantech learning and consultive services around the world.

Our vision is to create awareness and to help find innovative ways to connect our planet in a positive eco-friendly manner.

In the meantime, you may connect with CloudTweaks by following and sharing our resources.

View All Articles

Sorry, comments are closed for this post.

Comic
Are Cloud Solutions Secure Enough Out-of-the-box?

Are Cloud Solutions Secure Enough Out-of-the-box?

Out-of-the-box Cloud Solutions Although people may argue that data is not safe in the Cloud because using cloud infrastructure requires trusting another party to look after mission critical data, cloud services actually are more secure than legacy systems. In fact, a recent study on the state of cloud security in the enterprise market revealed that…

Having Your Cybersecurity And Eating It Too

Having Your Cybersecurity And Eating It Too

The Catch 22 The very same year Marc Andreessen famously said that software was eating the world, the Chief Information Officer of the United States was announcing a major Cloud First goal. That was 2011. Five years later, as both the private and public sectors continue to adopt cloud-based software services, we’re interested in this…

Moving Your Email To The Cloud? Beware Of Unintentional Data Spoliation!

Moving Your Email To The Cloud? Beware Of Unintentional Data Spoliation!

Cloud Email Migration In today’s litigious society, preserving your company’s data is a must if you (and your legal team) want to avoid hefty fines for data spoliation. But what about when you move to the cloud? Of course, you’ve probably thought of this already. You’ll have a migration strategy in place and you’ll carefully…

Three Factors For Choosing Your Long-term Cloud Strategy

Three Factors For Choosing Your Long-term Cloud Strategy

Choosing Your Long-term Cloud Strategy A few weeks ago I visited the global headquarters of a large multi-national company to discuss cloud strategy with the CIO. I arrived 30 minutes early and took a tour of the area where the marketing team showcased their award winning brands. I was impressed by the digital marketing strategy…

HOW THE CFAA RULING AFFECTS INDIVIDUALS AND PASSWORD-SHARING

HOW THE CFAA RULING AFFECTS INDIVIDUALS AND PASSWORD-SHARING

Individuals and Password-Sharing With the 1980s came the explosion of computing. In 1980, the Commodore ushered in the advent of home computing. Time magazine declared 1982 was “The Year of the Computer.” By 1983, there were an estimated 10 million personal computers in the United States alone. As soon as computers became popular, the federal government…

Micro-segmentation – Protecting Advanced Threats Within The Perimeter

Micro-segmentation – Protecting Advanced Threats Within The Perimeter

Micro-segmentation Changing with the times is frequently overlooked when it comes to data center security. The technology powering today’s networks has become increasingly dynamic, but most data center admins still employ archaic security measures to protect their network. These traditional security methods just don’t stand a chance against today’s sophisticated attacks. That hasn’t stopped organizations…

Using Private Cloud Architecture For Multi-Tier Applications

Using Private Cloud Architecture For Multi-Tier Applications

Cloud Architecture These days, Multi-Tier Applications are the norm. From SharePoint’s front-end/back-end configuration, to LAMP-based websites using multiple servers to handle different functions, a multitude of apps require public and private-facing components to work in tandem. Placing these apps in entirely public-facing platforms and networks simplifies the process, but at the cost of security vulnerabilities. Locating everything…

Fintech Investments Are Seeing Consistent Growth

Fintech Investments Are Seeing Consistent Growth

The Financial Services Cloud Fintech investment has been seeing consistent growth in 2015, with some large moves being made this year. The infographic (Courtesy of Venturescanner) below shows the top Fintech investors and the amount of companies they’re currently funding: Just this week, a financial data startup known as Orchard Platform raised $30 million in…

Cloud Infographic – Big Data Predictions By 2023

Cloud Infographic – Big Data Predictions By 2023

Big Data Predictions By 2023 Everything we do online from social networking to e-commerce purchases, chatting, and even simple browsing yields tons of data that certain organizations collect and poll together with other partner organizations. The results are massive volumes of data, hence the name “Big Data”. This includes personal and behavioral profiles that are stored, managed, and…

Cloud Computing Then & Now

Cloud Computing Then & Now

The Evolving Cloud  From as early as the onset of modern computing, the possibility of resource distribution has been explored. Today’s cloud computing environment goes well beyond what most could even have imagined at the birth of modern computing and innovation in the field isn’t slowing. A Brief History Matillion’s interactive timeline of cloud begins…

5 Reasons Why Your Startup Will Grow Faster In The Cloud

5 Reasons Why Your Startup Will Grow Faster In The Cloud

Cloud Startup Fast-tracking Start-ups face many challenges, the biggest of which is usually managing growth. A start-up that does not grow is at constant risk of failure, whereas a new business that grows faster than expected may be hindered by operational constraints, such as a lack of staff, workspace and networks. It is an unfortunate…

SaaS And The Cloud Are Still Going Strong

SaaS And The Cloud Are Still Going Strong

SaaS And The Cloud With the results of Cisco Global Could Index: 2013-2018 and Hosting and Cloud Study 2014, predictions for the future of cloud computing are notable. Forbes reported that spending on infrastructure-related services has increased as public cloud computing uptake spreads, and reflected on Gartner’s Public Cloud Services Forecast. The public cloud service…

Cloud Infographic – The Internet Of Things In 2020

Cloud Infographic – The Internet Of Things In 2020

The Internet Of Things In 2020 The growing interest in the Internet of Things is amongst us and there is much discussion. Attached is an archived but still relevant infographic by Intel which has produced a memorizing snapshot at how the number of connected devices have exploded since the birth of the Internet and PC.…

Using Private Cloud Architecture For Multi-Tier Applications

Using Private Cloud Architecture For Multi-Tier Applications

Cloud Architecture These days, Multi-Tier Applications are the norm. From SharePoint’s front-end/back-end configuration, to LAMP-based websites using multiple servers to handle different functions, a multitude of apps require public and private-facing components to work in tandem. Placing these apps in entirely public-facing platforms and networks simplifies the process, but at the cost of security vulnerabilities. Locating everything…

Big Data’s Significant Role In Fintech

Big Data’s Significant Role In Fintech

Data Banking Fintech covers a range of financial fields such as retail banking, investments, and lending and thanks to the mobile and internet innovations of late is a thriving sector. Offering improvements which drive customer satisfaction and education in an area previously inscrutable and dictated by gigantic inflexible corporations, fintech is helping put the power…

Battle of the Clouds: Multi-Instance vs. Multi-Tenant

Battle of the Clouds: Multi-Instance vs. Multi-Tenant

Multi-Instance vs. Multi-Tenant The cloud is part of everything we do. It’s always there backing up our data, pictures, and videos. To many, the cloud is considered to be a newer technology. However, cloud services actually got their start in the late 90s when large companies used it as a way to centralize computing, storage,…

Infographic: 9 Things To Know About Business Intelligence (BI) Software

Infographic: 9 Things To Know About Business Intelligence (BI) Software

Business Intelligence (BI) Software  How does your company track its data? It’s a valuable resource—so much so that it’s known as Business Intelligence, or BI. But using it, integrating it into your daily processes, that can be significantly difficult. That’s why there’s software to help. But when it comes to software, there are lots of…

Cukes and the Cloud

Cukes and the Cloud

The Cloud, through bringing vast processing power to bear inexpensively, is enabling artificial intelligence. But, don’t think Skynet and the Terminator. Think cucumbers! Artificial Intelligence (A.I.) conjures up the images of vast cool intellects bent on our destruction or at best ignoring us the way we ignore ants. Reality is a lot different and much…