Sizing up the Hadoop ecosystem

At its core, Hadoop is only made up of HDFS and MapReduce. But people and companies have made contributions that make Hadoop a more complete platform. Here's a cheat sheet to help you keep track of Hadoop developments.

Hadoop

Hadoop has a vast and vibrant developer community, but many projects in the Hadoop ecosystem have names that don't correlate to their function, making it difficult to figure out what each project does or is used for.

At its core, Hadoop is only made up of HDFS and MapReduce. But people and companies have made contributions that make Hadoop a more complete platform. Some of these ecosystem projects are Apache Foundation Projects (denoted by “A” below), while others are projects that are Apache licensed, but run by a company (“AL” below).

Here's a cheat sheet to help you keep track of Hadoop developments.  

HDFS

What it does: Acts as the file system or storage for Hadoop.

How it helps: Creates a replicated, fault tolerant, scalable file system that can handle huge files. Improves the data input performance of MapReduce jobs with data locality.

Cassandra

What it does: A highly scalable database.

How it helps: Allows you to scale a database in linear fashion.  Gives a tunable level of data consistency.

h-base A

What it does: Uses HDFS to create a highly scalable database.

How it helps: Allows high scalability and random access. Gives strong data consistency with HDFS.

Zookeeper

What it does: Provides synchronization of data among distributed nodes.

How it helps: Allows a cluster to maintain a consistent, distributed amount of smaller data across all nodes in a cluster.

Mapreduce

What it does: Breaks up a job into multiple tasks and processes them simultaneously.

How it helps: Framework abstracts the difficult pieces of distributed systems.  Allows vast quantities of data to be processed simultaneously.

Hive A

What it does: Allows use of query language to process data.

How it helps: Helps SQL programmers harness MapReduce by creating SQL-like queries.

impala

What it does: Allows low-latency queries on large amounts of data

How it helps: Helps SQL programmers access Big Data faster by creating SQL-like queries.

Pig

What it does: Processes data using a data flow or script-like language.

How it helps: Helps programmers use a data flow language to harness MapReduce power.

Mahoot

What it does: Use a prewritten library to run machine learning algorithms on MapReduce.

How it helps: Allows you to use a library to create recommendations and clusters with MapReduce. Speeds up development time by using existing code.

giraph

What it does: Use a prewritten library to run graph algorithms on MapReduce.

How it helps: Prevents you from having to rewrite graph algorithms to use MapReduce. Speeds up development time by using existing code.

MRunit

What it does: Run tests to verify your MapReduce job functions correctly.

How it helps: Run programmatic tests to verify that a MapReduce program acts correctly. Has objects that allow you to mock up inputs and assertions to verify the results.

Avro

What it does: Gives an easy method to input and output data from MapReduce jobs.

How it helps: Creates domain objects to store data. Makes easier data serialization and deserialization for MapReduce jobs.

sqoop

What it does: Moves data between Relational Databases and Hadoop.

How it helps: Allows data dumps from the Relational Database to be placed in Hadoop for processing later. Moves data output from a MapReduce job to be placed back in a Relational Database.

flume

What it does: Handles large amounts of log data in a scalable fashion.

How it helps: Moves large amounts of log data into HDFS. Since Flume scales so well, it can handle a lot of incoming data.

Hue

What it does: Allows users to interact with the Hadoop cluster over a web browser.

How it helps: Makes it easier for users to interact with the Hadoop cluster.  Granular permissions allow administrators to configure users'.

hue

What it does: Makes creating complex workflows in Hadoop easier to create.

How it helps: Allows you to create a complex workflow that leverages other projects like Hive, Pig and MapReduce. Built-in logic allows users to handle failures of steps gracefully.

More on Hadoop: Hortonworks: The Hadoop company everyone wants to be friends with

Hadoop goes to the cloud