This introduces significant cost savings at the hardware and management levels. It should be noted that while HDFS is the usual filesystem used with Hadoop, it is by no means the only one. For its Elastic Compute Cloud (EC2) solutions, Amazon has adapted its S3 filesystem for Hadoop. DataStax' Brisk is a self-described Hadoop distribution that replaces HDFS with Apache Cassandra's CassandraFS and throws in the data query and analysis capabilities of Hive in for good measure to unify realtime storage and analytics capabilities. And such customization and adaptation is made all the easier thanks to Hadoop's open source nature.
MapReduce is a bit harder to conceptualize. Murthy describes it as a data processing, programming paradigm ... but what does that mean, exactly? As an illustration, it helps to think of MapReduce as analogous to the database engine, much as Jet is the engine for Microsoft Access (as many people do not recall).
When a request for information comes in, MapReduce uses two components: a JobTracker that sits on the Hadoop master node, and TaskTrackers that sit out on each node within the Hadoop network. The process is fairly linear. MapReduce will break data requests down into a discrete set of tasks, then use the JobTracker to send the MapReduce jobs out to the TaskTrackers. To cut down on network latency, jobs are assigned to the same node where the data lives, or at the very least to a node on the same rack.
There's more to Hadoop than just the distributed filesystem and MapReduce, as Figure 1 shows. A Hortonworks' representation of the Hadoop framework, this image shows other components that can be used with Hadoop, including:
HCatalog: A table and storage management service for Hadoop data.
Pig: A programming and data flow interface for MapReduce.
Hive: A data warehousing solution that makes use of a SQL-like language, HiveQL, to create queries for Hadoop data.
It is Hive, Murthy said, that makes Hadoop much easier to use than one might expect from a so-called NoSQL database. Using HiveQL, data analysts can pull out information from a Hadoop database with the same kind of queries they're used to using in a RDBMS. Moving to Hadoop will make for a transition, of course, as there are some differences between SQL and HiveQL, but these differences are not that great.
What do you need to know?
Data analysts won't have too much trouble adapting to Hadoop, but DBAs may face a steeper learning curve. That's because the distributed filesystem is a big departure from the traditional realm of database table storage in RDBMS.
The complexity of Hadoop is definitely a big hurdle to jump for prospective administrators, because the framework composition of all of the different Hadoop components means you have to manage a lot of different elements at once. Don't look for a shiny GUI to handle this, either. Hadoop, Hive, Sqoop, and other tools in the Hadoop ecosystem are controlled from the command line. Since Hadoop is Java-based, and MapReduce makes use of Java classes, a lot of the interaction is the kind where experience as a developer (and as a Java developer in particular) will be very handy.
Most Hadoop-related jobs typically call for experience with large-scale, distributed systems, and a clear understanding of system design and development through scaling, performance, and scheduling. In addition to experience in Java, programmers should be hands on and have a good background in data structures and parallel programming techniques. Cloud experience of any kind is a big plus.
This is a lot to have under your belt; so, for systems engineers and administrators who want to make the jump to Hadoop, Hortonworks will be offering a three-day Administering Apache Hadoop class. Cloudera has an active administration course now, as part of its Cloudera University curriculum. Courses on Hive, Pig, and developer training are also available. You can find additional coursework on the Hadoop Support wiki on the Apache site.
Moving down the road
Part 2 of "The Road to Hadoop" will look at the business implications moving to Hadoop. You'll see what businesses should be using Hadoop and how deployments usually happen. In Part 3, we'll examine the techniques and costs involved in moving to Hadoop from an existing RDBMS, as well as the tools used to analyze Hadoop data faster and more cheaply than any RDBMS.
This article, "The road to Hadoop, Part 1: Skill sets and training," was originally published at ITworld. For the latest IT news, analysis and how-tos, follow ITworld on Twitter, Facebook, and Google+.
Now read this interview with Hadoop creator Doug Cutting