Many people associate open source data framework Hadoop with managing truly massive amounts of data. And with good reason: Hadoop storage is used by Facebook and Yahoo, which many people (rightly) associate with massive data. As you learned in Part 1 of this series, Yahoo, an early adopter and contributor to Hadoop, has implemented a 50,000-node Hadoop network; Facebook has a Hadoop system with more than 10,000 nodes in place.
So, there's the big in "big data" for you.
But Arun Murthy, VP, Apache Hadoop at the Apache Software Foundation and architect at Hortonworks, Inc., paints a different picture of Hadoop and its use in the enterprise. For Murthy, Hadoop's use goes far beyond big data. One of Hadoop's strongest capabilities is its ability to scale. Yahoo and Facebook are excellent examples of how Hadoop can scale up; but little is usually said about how Hadoop can scale the other way and provide analytic decision-making data for businesses of any size.
All data created equal
Data storage, Murthy explained, used to be expensive. As recently as five years ago, enterprises and SMBs found themselves having to keep track of an exploding array of datasets: e-mails, search results, sales data, inventory data, customer data, click-throughs on Web sites ... all of this and more might be coming in, and trying to manage it in a relational database management system (RDBMS) was a very expensive proposition.
With all of these events and signals coming in, an organization trying to keep costs down and data management sane would typically sample that data down to a smaller subset. This downsampled data, which Murthy calls "historical data," would automatically be classified based on certain assumptions -- the number one assumption being that some data would always be more important than other data.
For example, the priorities for e-commerce data would be set on the (reasonable) assumption that credit card data would be more important than product data, which in turn would be more important than click-through data.
If you were trying to run a business model based on one given set of assumptions, then it wouldn't be hard to pull information out to make decisions for the business. But the information would always be predicated on those assumptions; what would happen when the assumptions changed? Because data was downsampled, any new business scenario would have to use the sanitized data still in storage -- all the raw data would be long gone. And, because of the expense of RDBMS-based storage, often this data would be siloed within an organization. Sales would have their data, marketing would have theirs, accounting their own, and so on. So business-model decisions would be limited to each part of the organization examined -- not the complete whole.
"With Hadoop," Murthy argued, "there are no assumptions, because you keep all of the data."
This is perhaps the biggest benefit of Hadoop, though it often lurks in the background, behind the notion of Hadoop's low financial costs. (More on those in a moment.) "Downsampling makes the assumption that some data is going to be bigger and more important than other data," Murthy explained. "In Hadoop, all data has equal value."
Because all data is equal -- and equally available -- business scenarios can be run with raw data at any time, without limitation. Moreover, formerly siloed data can be equally accessed and shared for more holistic analysis of an organization's business.
This shift in how data can perceived is huge, because now there is no such thing as historical data. Moreover, because data can be stored as is, much of the data management overhead associated with such things as extract, transform, and load operations will be reduced.