Big data is a concept that's been as widely hyped as cloud computing, and perhaps just as misunderstood in regards to its capabilities and limitations. One of the aspects of big data that is not clearly understood is how existing databases can be used with data storage engines that are non-relational in nature. What's involved in moving data from a relational database management system (RDBMS) to distributed systems? And, perhaps more of interest to IT staff, what's the best way to learn about these big data systems, to determine the best way to use them in an organization?
Currently, the most popular example of a non-relational database management system (NDBMS) is probably Apache Hadoop, a distributed data framework that seems to be the poster child for big data and so-called NoSQL databases. But even those descriptions screen the true nature of Hadoop and how it works. What is Hadoop, really, and how can businesses and IT staffers start using it? Which businesses should use Hadoop and where can you find resources for implementing it?
These are some of the questions that will be answered in this three-part series, "The Road to Apache Hadoop." In part one, we'll examine how Hadoop is put together. Understanding how Hadoop generally works will give a more accurate picture of the skills DBAs and data analysts need to work with the Hadoop framework. And knowing the players in the Hadoop ecosystem will reveal some great sources for Hadoop training.
What Hadoop isn't
There are two elements that need to be cleared up about Hadoop right away: it's not a system that needs to be used exclusively with "big data," and it isn't really a NoSQL tool.
While it is true that Hadoop belongs to the non-relational class of data management systems, that doesn't preclude the use of the SQL query control language. NoSQL isn't even that; it's just a way to describe databases where SQL is not necessarily the only query system that can be used. In fact, SQL-like queries can be used fairly easily with Hadoop.
Then there's this notion of big data. Many people associate Hadoop with managing truly massive amounts of data. And with good reason: Hadoop storage is used by Facebook and Yahoo, which many people (rightly) associate with huge data sets. But Hadoop's use goes far beyond that of big data. One of Hadoop's strongest capabilities is its ability to scale, which puts it up in the big leagues with Yahoo and Facebook but also enables it to scale down to any company that needs inexpensive storage and data management.
To understand this broad capability of scale and the implications of scaling, it's important to understand how Hadoop works.
What Hadoop is
Arun Murthy is a man who knows Hadoop. As VP, Apache Hadoop at the Apache Software Foundation, he's the current leader of the project. Moreover, Murthy has been involved with Hadoop from its early days, when Yahoo adapted Google's open source data framework to their own purposes after it was invented by Doug Cutting to take advantage of Google's MapReduce data programming framework.
To say Yahoo is a big player in the Hadoop space is an understatement on many levels. Cutting, Murthy, and many early Hadoop contributors worked at Yahoo in the last decade. (Cutting now works with Cloudera, a commercial Hadoop vendor launched in 2009, and Murthy went on to co-found Hortonworks, Inc., in June of 2011 with several others on Yahoo's Hadoop team, including Eric Baldeschwieler, now Hortonworks' CEO.) Yahoo is also the largest user of Hadoop to date, having deployed a staggering 50,000-node Hadoop network.
With these credentials in mind, Murthy seemed the best person to ask about what Hadoop is and how it's put together.
As he explained it to me, the framework known as Hadoop can be composed of several components, but the big two are the aforementioned MapReduce data processing framework and a distributed filesystem for data storage, usually something like the Hadoop Distributed Filesystem (HDFS).
HDFS is, in many ways, the simplest Hadoop component to conceptualize (though not always the easiest to manage). It is pretty much what it's called: a distributed filesystem that shoves data onto any machine connected to the Hadoop network. Of course, there's a system to this, and it's not just all willy-nilly, but compared to the highly regimented storage infrastructure of a RDBMS, it's practically a pigsty.
Indeed, it's this flexibility that brings a lot of the value to the Hadoop proposition. While an RDBMS often needs finely tuned and dedicated machines, a Hadoop system can take advantage of commoditized servers with few good hard drives on board. Instead of dealing with the huge management overhead involved with storing data in relational database tables, Hadoop uses HDFS to store data across multiple machines and drives, automatically making data redundant on multiple-node Hadoop systems. If one node fails or slows down, the data is still accessible.