When is Hadoop no longer Hadoop?

HDFS, MapReduce replacements create identity crisis

The Hadoop elephant mascot may soon need to be replaced by a mythical chimera, as many technologies are stepping in to replace the two main components of Hadoop: the Hadoop Distributed File System and MapReduce.

It's become a serious question lately: when is Hadoop no longer Hadoop?

Both HDFS and MapReduce are considered to be well-known elements of Hadoop, with HDFS acting as the straightforward storage component and MapReduce as the Java-based batch analytical processor for data stored in Hadoop. But each one of those components can be swapped out for something better.

Just this week, data storage vendor Cleversafe announced the release of a new API that will enable Hadoop clusters to use Cleversafe's scalable object-based Dispersed Storage System.

A distributed storage system like this instead of HDFS will mean some strong improvements for Hadoop. One Hadoop limitation is the single name-node architecture of any given Hadoop cluster. In any such cluster, there is just one name node machine that (analogous to a file allocation table on a single hard drive) uses metadata to track where data is actually sitting on the data nodes in the cluster.

This setup immediately causes two potential problems: first, there's the single point of failure issue. If your name-node goes bye-bye, so does your cluster (unless you've built in some failover configurations). Then there's the hard ceiling on metadata storage. Eventually, that single name node machine is going to fill up with metadata, which sets a limit on how much data a cluster can have.

Since the single name node means Hadoop clusters can't scale up indefinitely, these problems are traditionally solved by scaling out. Add more clusters, the logic goes, and you build in failovers and more room for storage.

Cleversafe's distributed system neatly avoids the single name cluster issue, and delivers another big improvement over HDFS: Cleversafe's system doesn't require the 3X replication of data that HDFS does.

Cleversafe is not the only player within the Hadoop chimera game: DataStax Brisk replaces HDFS with Cassandra's CassandraFS and (and throws in the data query and analysis capabilities of Hive in for good measure).

And the MapReduce element of Hadoop is not sacrosanct: Concurrent's Cascading 2.0 was recently released as a bigger and better MapReduce alternative.

Even MapReduce itself is getting an overhaul: the next release of MapReduce (known as YARN) will introduce a much more robust architecture, allowing each MapReduce application connecting to MapReduce to have its own ApplicationMaster resource, which would enable greater flexibility on the types of applications with which MapReduce would be able to work.

With all of the swapping of pieces and parts, it's going to be interesting to see what really defines Hadoop.

We may see the term "Hadoop" shifting from just one specific software application to a description of a broader infrastructure that combines a inexpensive, unstructured data storage with some form of analytical processor.

Of course, the Apache Software Foundation, which owns the trademark for Apache Hadoop, may have something to say about that. Which swings me around to the initial question: when does Hadoop no longer become Hadoop?

Read more of Brian Proffitt's Open for Discussion blog and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon