The value of big data is no longer a secret to most companies. The main problem with big data, though, is that it’s, well, big. That is, the volume of data that companies would like to understand is so large and coming so fast that just organizing, manipulating and analyzing it is a problem and, sometimes, prohibitive. Conventional relational databases often can’t handle today’s big data, or can’t process them in a reasonable amount of time.
The traditional approach to solving this problem has been to come up with more efficient and powerful systems to process larger and larger amounts of data. For example, massively parallel-processing (MPP) databases, distributed file systems and cloud-based infrastructures have all been applied to the problem. Even with these solutions, the size and increase in big data continues to be a challenge
Several computer science researchers at MIT, however, are taking a new approach to the problem: they’ve come up with a method to, effectively, make big data smaller. In a paper titled The Single Pixel GPS: Learning Big Data Signals from Tiny Coresets, Dan Feldman, Cynthia Sung, and Daniela Rus outline this new approach. The basic idea is to take big data and extract a coreset, defined as a “a smart compression of the input signal,” then query and analyze these compressed data. Their compression method also has the benefit being able to be applied to data as it’s received, say daily or hourly, in manageable chunks.
Put another way, they take an incoming stream of data and identify patterns via statistical estimation (e.g., regression analysis). By then representing the true (big) data with this much smaller set of approximations (along with a small set of randomly selected data points), they land up with a data set that can be managed and analyzed using traditional tools and techniques and should provide similar results to analyzing the original data. It’s a potentially revolutionary approach that could be applied to a wide range of big data problems.