As an example, they tested the approach on the problem of discerning patterns in GPS data. Instead of having to process every coordinate collected, they reduced the problem to one of estimating a set of common routes or paths (the coresets) via linear regression, that could then be used for analysis. They were able to compress a data set of over 2.6 million data points from San Francisco taxis to smaller ones that ranged from 0.14% to 1.79% of the size of of the original while still preserving the core information in the data. The amount of compression achieved depended on the quality of the approximations to the true data.
This last point indicates a potential drawback to the approach: by approximating the meaningful portion of the data via estimation in order to achieve compression, errors are introduced. This means that analytics based on the compressed data, potentially, aren’t as accurate as they would be if they were based on the original information. However, since the estimations are based on large data sets, they find that the margin of error can be small enough to be considered an acceptable tradeoff for the data compression achieved.
What does all this mean for the average business wrestling with taming big data? In the short term not much. But in the longer run these methods could lead to newer, cheaper approaches to a wide range of big data problems. The authors argue that their methods have ”many potential applications in map generation and matching, activity recognition, and analysis of social networks.“
We’ll keep an eye on it, so stay tuned...