With that data, the store can determine buyer behavior, such as what percent of customers purchase books based on their favorite author.
"We have to decide with analytics on hand how we capture the customer's imagination and how we move forward," he said.
Other companies are using big data analytics to track the use of content on their Web sites in order to better tailor it to users' tastes.
Sondra Russell, a metrics analyst with National Public Radio, said she needed a way to track Web site audience use trends in near real time. NPR offers podcasts, live streams, on-demand streams and other radio content on its Web site. Her organization had been using Web analytics engine Omniture, but it felt like she was trying to jam log-based data into a client-side tracking system that couldn't handle the volume.
Russell said NPR experienced query delays that at best were six to 12 hours long and at worst, weeks long. The organization finally switched to Splunk's reporting tool, which crawls logs, metrics and other application, server and network data and indexes it in a searchable repository.
"I just want to know how many times someone listened to a program during a certain period of time," she said. "With Splunk I had no delays between data appearing in a query folder and data appearing in reports. I can get any number of graphs without weeks of prep time."
IBM 's Jonas compared big data to puzzle pieces, saying until you take them to the table top and begin assembling them, you don't know what you have. That's where Hadoop, Cassandra and other analytics engines come in. Hadoop is a distributed software file system, based on Google's MapReduce algorithm, which allows large-scale computations (batch processing) to be performed across large server clusters in parallel. The computations can be performed on user or machine-generated data, whether structured or unstructured. But Hadoop works best on unstructured random data sets, allowing analytics engines to more quickly gather information from queries.
MapReduce systems differ from traditional databases in that they can quickly presort data in a batch process, regardless of the type of data: file or block. They can also interface with any number of languages: including C++, C#, Java, Perl, Python and Ruby. Once sorted, a more specific analytical application is required to perform specific queries. Traditional databases can be considerably slower, requiring table-by-table analysis. They also do not scale nearly as well.
For example, Alfred Spector, vice president of research and special initiatives at Google, said it's not inconceivable that a cluster of servers could someday include 16 million processors creating one MPP data warehouse.