Hadoop subprojects include MapReduce, which is a software framework for processing large set sets on compute clusters; HDFS (Hadoop Distributed File System), which provides high-throughput access to application data; and Common, which offers utilities to support other Hadoop subprojects. Movie rental service Netflix has begun using Apache ZooKeeper, a Hadoop-related technology for configuration management. "We use it for all kinds of things: distributed locks, some queuing, and leader election" for prioritizing service activity, says Jordan Zimmerman, a senior platform engineer at Netflix. "We open-sourced a client for ZooKeeper that I wrote called Curator"; the client serves as a library for developers to connect to ZooKeeper.
The Tagged social network is using Hadoop technology for data analytics, processing about half a terabyte of new data daily, says Rich McKinley, Tagged's senior data engineer. Hadoop is being applied to on tasks beyond the capacity of its Greenplum database, which is still in use at Tagged: "We're looking toward doing more with Hadoop just for scale."
Although they laud Hadoop, users see issues that need fixing, such as deficiencies in reliability and job-tracking. Tagged's McKinley notes a problem with latency: "The time to get data in is quite quick and then, of course, I think everybody's big complaint is the high latency for doing your queries." Tagged has used Apache Hive, another Hadoop-derived project, for ad hoc queries. "That can take several minutes to get in a result that in Greenplum would return in a couple of seconds." Using Hadoop is cheaper than using Greenplum, though.
What's in store for Hadoop 2.0Hadoop 1.0 was released late in 2011, featuring strong authentication via Kerberos and support for the HBase database. The release also limits individual users from taking down clusters via constraints on MapReduce. But a new version is on the horizon: HortonWorks CTO Eric Baldeschwieler has provided a road map for Hadoop that includes the upcoming 2.0 release. (HortonWorks has been a contributor to Apache Hadoop.) Version 2.0, which went into an alpha release phase earlier this year, "has an end-to-end rewrite of the MapReduce layer and a pretty complete rewrite of all the storage logic and the HDFS layer as well," Baldeschwieler says.