Making big data smaller
MIT researchers develop a new approach to working with big data: reduce it to a size that can be managed and analyzed with conventional tools
The value of big data is no longer a secret to most companies. The main problem with big data, though, is that it’s, well, big. That is, the volume of data that companies would like to understand is so large and coming so fast that just organizing, manipulating and analyzing it is a problem and, sometimes, prohibitive. Conventional relational databases often can’t handle today’s big data, or can’t process them in a reasonable amount of time.
The traditional approach to solving this problem has been to come up with more efficient and powerful systems to process larger and larger amounts of data. For example, massively parallel-processing (MPP) databases, distributed file systems and cloud-based infrastructures have all been applied to the problem. Even with these solutions, the size and increase in big data continues to be a challenge
Several computer science researchers at MIT, however, are taking a new approach to the problem: they’ve come up with a method to, effectively, make big data smaller. In a paper titled The Single Pixel GPS: Learning Big Data Signals from Tiny Coresets, Dan Feldman, Cynthia Sung, and Daniela Rus outline this new approach. The basic idea is to take big data and extract a coreset, defined as a “a smart compression of the input signal,” then query and analyze these compressed data. Their compression method also has the benefit being able to be applied to data as it’s received, say daily or hourly, in manageable chunks.
Put another way, they take an incoming stream of data and identify patterns via statistical estimation (e.g., regression analysis). By then representing the true (big) data with this much smaller set of approximations (along with a small set of randomly selected data points), they land up with a data set that can be managed and analyzed using traditional tools and techniques and should provide similar results to analyzing the original data. It’s a potentially revolutionary approach that could be applied to a wide range of big data problems.
As an example, they tested the approach on the problem of discerning patterns in GPS data. Instead of having to process every coordinate collected, they reduced the problem to one of estimating a set of common routes or paths (the coresets) via linear regression, that could then be used for analysis. They were able to compress a data set of over 2.6 million data points from San Francisco taxis to smaller ones that ranged from 0.14% to 1.79% of the size of of the original while still preserving the core information in the data. The amount of compression achieved depended on the quality of the approximations to the true data.
This last point indicates a potential drawback to the approach: by approximating the meaningful portion of the data via estimation in order to achieve compression, errors are introduced. This means that analytics based on the compressed data, potentially, aren’t as accurate as they would be if they were based on the original information. However, since the estimations are based on large data sets, they find that the margin of error can be small enough to be considered an acceptable tradeoff for the data compression achieved.
What does all this mean for the average business wrestling with taming big data? In the short term not much. But in the longer run these methods could lead to newer, cheaper approaches to a wide range of big data problems. The authors argue that their methods have ”many potential applications in map generation and matching, activity recognition, and analysis of social networks.“
We’ll keep an eye on it, so stay tuned...