Taking the measure of big data: More is more

Image credit: antony_mayfield/Flickr

You may have heard that some in the industry believe that 99.9% of big data is "worthless." Indeed, I've written about the scientists at the Large Hadron Collider who discard the vast majority of the petabytes of data their experiments produce. And we all know about the huge swathes of our SANs wasted on unused data. Organizations confront mountains of extraneous and redundant data. That's a well-understood problem.

But is it really "worthless"?

I vigorously disagree. Even if data is not useful to an analyst for the task at hand, it may become so to another analyst for a subsequent task. And if it is redundant, understanding what processes are causing the redundancy might lead to improved business performance. Understanding the whole of a big data opportunity means being able to discern which part of your data set has value and which does not. That doesn't mean that what you discard today will be worthless tomorrow. It depends on the questions you ask the data, how you ask them and when.

More to the point, as

Hal Varian, Google's Chief Economist
observes, to conduct valid predictive analytics, perhaps the most important use of big data, you need to start with a random data set, even if it's a small one. As engineers at Google know, to get that truly random data set, that sliver of data needs to come from a massive amount of information. Without a large enough pool of data to draw from, the validity of your data set and subsequent analytics loses precision. In other words, even unused, big data generates the most valid data sets for modeling.

Philip Russom, The Data Warehousing Institute's (TDWI) director of data management, points to another critical aspect of big data. He argues that big data is discovery-oriented - it's where you look for facts you never knew before. He warns if you over-massage big data, you risk eliminating outliers in the data, credit card fraud, for example, which might be exactly what you need to find.

Blithely dismissing 99.9% of big data as worthless shows a lack of insight when it comes to understanding the big data problems modern enterprises face. Seeing the world strictly through a traditional database mentality, where only the purest, cleanest, most massaged, and manageably smallest data set is trusted, leads to bald and false proclamations about the worth of big data.

Such a view does not lead to effective predictive analytics. It strikes me as a purely defensive position on the part of those who do not have the tools to exploit the enormous worth of big data.

Related reading:

Invent new possibilities with HANA, SAP's game-changing in-memory software

SAP Sybase IQ Database 15.4 provides advanced analytic techniques to unlock critical business insights from Big Data

SAP Sybase Adaptive Server Enterprise is a high-performance RDBMS for mission-critical, data-intensive environments. It ensures highest operational efficiency and throughput on a broad range of platforms.

SAP SQL Anywhere is a comprehensive suite of solutions that provides data management, synchronization and data exchange technologies that enable the rapid development and deployment of database-powered applications in remote and mobile environments

Overview of SAP database technologies
Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies