YARN expands Hadoop's big data wardrobe

The next step in MapReduce's evolution

Hadoop has been the undisputed king of big data for a while, but while the emperor has some clothes, the limitations of Hadoop have made the outfit skimpier than it needed to be. Now Hadoop developers are building the new processing system known as YARN, to knit the emperor some better clothes.

YARN was recently upgraded to a full-fledged sub-project within Apache Hadoop. In the context of the Apache Software Foundation, that's kind of a big deal: previously, YARN was worked on under the MapReduce sub-project, so moving it out to a separate project with its own focus is regarded in the Hadoop community as a significant step.

Explaining YARN can be a tricky thing. If you recall, Hadoop is made up of two key elements: the Hadoop Distributed File System and the MapReduce processing engine. It's MapReduce that gives Hadoop much of its mojo for processing big data. The Map part is accomplished by dividing computing jobs up into defined pieces and shifting those jobs out to the machine on the cluster where the needed data is stored. Once the query is run, that dynaset is Reduced back to the central node of the Hadoop cluster, combined with all the other dynasets from the cluster's machines.

Keeping the data where it is is why Hadoop is relatively fast and so scalable. Any time you have to move data around a networked cluster, you're time factor shoots way up. And because the machine where the data is stored can also handle the computations of the MapReduce jobs, the processing scalability goes hand in hand with each new machine you add to the cluster for storage.

But there are still those pesky limitations. In order to work effectively, Hadoop works in batch processing mode. Meaning you have to set up a MapReduce job (or series of jobs) and let them run one at a time, in batches. It's fast for the amount of data we're talking about, but no where near real-time speeds. Plus, as nearly anyone who's worked with it can tell you, writing MapReduce jobs (which are based in Java) is a pain in the tucchus.

YARN is the next step in the evolution of Hadoop that will address some of those limitations.

Apache Hadoop project leader Arun Murthy does a good job explaining how this next set of steps will work. The co-founder of Hortonworks, one of the vendors that works to package and deliver Hadoop as a commercial offering (much like Red Hat and SUSE do with Linux), highlighted the significance of the move of YARN to sub-project status.

"This is a signal from the Apache Hadoop community that we can support other than MapReduce apps in Hadoop," Murthy said in a recent interview.

What YARN will do, essentially, is divide the functionality of MapReduce even further, breaking the two major responsibilities of Hadoop's JobTracker--resource management and job scheduling/monitoring--into separate daemons: a global ResourceManager and per-application ApplicationMaster.

"The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system," Murthy wrote in a recent Hortonworks blog. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks."

In English, splitting these functions up gets much more robust way of managing a Hadoop cluster's resources than the current MapReduce systems can handle. Resources are managed in a way that is close to an operating system's way of handling jobs.

This will also give developers a more flexible way to plug into Hadoop, building tools that can work with data in ways that the more rigid and limited MapReduce could not.

Murthy was quick to emphasize that all of these improvements from YARN would not break existing MapReduce jobs for Hadoop end users. YARN is very much based in the original MapReduce architecture, and compatibility is ensured for future versions of Hadoop.

Already there are open source projects working with YARN to fulfill certain needs in the big data community. Murthy mentioned Storm, an open source project using YARN to build a "Hadoop in real-time" platform. Apache S4, a distributed stream computing platform, is also working with YARN tech.

If YARN actually fulfills the expectations of its creators, it could offer the Hadoop community a huge capability to be more creative in working with data. In terms of functionality, it's akin to jumping from a great scientific calculator to a high-end computing cluster.

YARN can't make Hadoop all things to all big data needs, but it can certainly give Hadoop more outfits to wear as a big data solution.

Read more of Brian Proffitt's Open for Discussion blog and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon