Instead of creating a clean subset of user data to place in a data warehouse to be queried against a limited number of predetermined ways, big data software just collects all the data an organization generates, and allows administrators and analysts to worry about how to use the data later. In this sense, they are more scalable than traditional databases and data warehouses.
How the Internet spurred big data
In many ways, the giant online service providers such as Google, Amazon, Yahoo, Facebook and Twitter have been on the cutting edge of learning how to make the most of such large data sets. Google and Yahoo, among others, had a hand in developing Hadoop. Facebook engineers first developed the Apache Cassandra distributed database, also open source.
Hadoop got its start from a 2004 Google white paper, one that described the infrastructure Google built to analyze data across many different servers, using an indexing system called Bigtable. Google kept Bigtable for internal use, but Doug Cutting, a developer who had already created the Lucene/Solr open source search engine, created an open source version, naming the technology after his son's stuffed elephant.
One early adopter of Hadoop was Yahoo. The company hired Cutting and started dedicating large amounts of engineering work to refining the technology around 2006. "Yahoo had lots of interesting data across the company that could be correlated in various ways, but it existed in separated systems," said Cutting, who now works for Hadoop distribution provider Cloudera.
Yahoo is now one of Hadoop's biggest users, deploying it on more than 40,000 servers. The company uses the technology in a variety of ways. Hadoop clusters hold massive log files of what stories and sections users click on. Advertisement activity is also stored on Hadoop clusters, as are listings of all the content and articles Yahoo publishes.
"Hadoop is a great tool for organizing and condensing large amounts of data before it is put into a relational database," Monash said. The technology is particularly well suited for searching for patterns across large sets of text.
Another big data technology that got its start at an online service provider was the Cassandra database. Cassandra is able to store 2 million columns in a single row, making it handy for appending more data onto existing user accounts, without knowing ahead of time how the data should be formatted.
Using a Cassandra database can also be advantageous in that it can spread across multiple servers, which helps organizations scale their databases easily beyond a single server, or even a small cluster of servers.