- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- HDFS: A distributed file system that provides high throughput access to application data.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce: A software framework for distributed processing of large data sets on compute clusters.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper: A high-performance coordination service for distributed applications.
Most implementations of a Hadoop platform will include at least some of these subprojects, as they are often necessary for exploiting Big Data. For example, most organizations will choose to use HDFS as the primary distributed file system and HBase as a database, which can store billions of rows of data. And the use of MapReduce is almost a given since its engine brings speed and agility to the Hadoop platform.
With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. The MapReduce framework is broken down into two functional areas: Map, a function that parcels out work to different nodes in the distributed cluster, and Reduce, a function that collates the work and resolves the results into a single value.
One of MapReduce's primary advantages is that it is fault-tolerant, which it accomplishes by monitoring each node in the cluster; each node is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and reassigns the work to other nodes.
Building on Hadoop In addition to many open source support tools such as Clojure and Thrift, dozens of commercial options exist as well, though many are built using Hadoop as the foundation. The PricewaterhouseCoopers Center for Technology and Innovation has published an in-depth guide to the Big Data building blocks and how they relate to both IT deployment and business usage.