Top down: CXO mandates
Another common way Hadoop is deployed is from the top down. A C-level executive watching trends will note the very low costs of storage on a Hadoop system and will begin to formally explore whether the Hadoop solution is the right thing for the company.
This is where vendors like Murthy's current employer, Hortonworks, Inc., comes in. Hortonworks, launched at the end of June of 2011, was founded by Murthy and several other members from Yahoo's Hadoop team, and provides open source Hadoop products as well as training, support, and deployment services.
Usually, Murthy explained, Hortonworks will work with a potential new client and make a small set of recommendations based on what the client needs. They will also deploy a small proof-of-concept Hadoop cluster, anywhere from 20 to 100 nodes, and let the customer see the value of Hadoop for themselves. This formal process is similar to what other Hadoop vendors, such as Cloudera and MapR, provide, so there you'll have a number of strong options to choose from when seeking Hadoop consulting and support.
Get the Sqoop
Whether you do it yourself, or employ help to do it, at some point you are going to need to migrate your data from its current storage location to Hadoop.
The best tool for doing this, especially from an RDBMS, is Cloudera's Sqoop ("SQL-to-Hadoop"). Sqoop is a command-line application that can import individual tables or whole databases into the Hadoop Distributed Filesystem (HDFS). Sqoop uses the DBInputFormat Java connector that enables MapReduce to pull in relational database data via the JDBC interface found in MySQL, Postgresql, Oracle, and most other popular databases.
Sqoop will also generate the Java classes needed for MapReduce to interact with the data, by deserializing record rows into discrete fields of information. You can also use Sqoop to import RDBMS data right into your Hive data warehouse.
Because of this functionality, there is very little you should have to do to prepare your data for a migration to Hadoop, other than common sense practices like deduping your data and keeping your RDBMS maintained.
Explore the Hive
As described in the first article in this series, Hive is the part of the Hadoop framework that enables analysts to structure and query data in the HDFS. Data can be summarized, queried, and analyzed using the Hive Query Language (HiveQL), which is similar enough to SQL to make such operations not too difficult for analysts to use.
Hive will also enable MapReduce programmers to directly plug in their custom data mappers and data reducers, should the HiveQL language prove to be unable to provide the information needed.
Care should be used when considering Hive: because Hadoop is a batch processing system, its jobs have high latency, which translates into very high latencies for Hive queries (as in minutes, not seconds). As such, Hive is not really a good system to use for real-time processing. If this is your need, consider working with Apache Cassandra, an open source distributed database management system that is much better for handling real-time needs.
Arriving at Hadoop
The migration path to Hadoop will vary, depending on your organization's needs, but Hadoop is a system that may surprise you in the value it can provide.
Hadoop is not strictly the purview of big data. It's for any organization that needs cheaper storage and the capability to analyze a lot of data efficiently. Is that organization yours?
Now read this interview with Hadoop creator Doug Cutting