June 13, 2012, 9:07 AM — If you wanted to use Amazon's Elastic MapReduce (EMR) service for Hadoop functionality in a public cloud, then you would have had to use Amazon's own version of Hadoop to run your big data jobs--until today.
Coupled with the release of their 2.0 Hadoop distribution, MapR will have the first non-Amazon machine instances of Hadoop available as an option on EMR, starting today. It will be the first commercial vendor of Hadoop to get this kind of availability in Amazon's public cloud.
This is a pretty big deal for MapR. Although you can (and some do) run any flavor of Hadoop (Cloudera, Hortonworks, or straight-up Apache Hadoop) as machines on Amazon's Elastic Cloud Compute (EC2) service, according to Amazon, most Hadoop jobs on their cloud systems were actually running on EMR, not EC2.
By giving EMR users the option to run MapR M3 and M5 editions on EMR, MapR is suddenly getting a lot more exposure even as customers get more choices.
MapR Hadoop's M3 edition, also known as its Community Edition, features a base packaging of all the Hadoop goodies, such as HBase, Pig, Hive, Mahout, Cascading, Sqoop, and Flume. The M5 Enterprise Edition has all of that plus additional high availability and data protection tools.
In non-cloud environments, the Community Edition of MapR Hadoop is free and the Enterprise Edition has a license fee attached. That pricing model will be mirrored in EMR: the M3 edition is available at no additional charge over standard Amazon EMR usage fees, while the M5 edition adds an additional hourly cost.
According to VP of Marketing Jack Norris, using the MapR Hadoop distro on EMR will give Hadoop users access to all of Amazon's management tools for the "basics" of spinning server instance up and down, and to MapR's Hadoop-specific management tools that take care of process flows and MapReduce jobs.
These management tools are also getting a renovation in today's general release of MapR Hadoop 2.0. One of the more constricting elements of Hadoop use is the default job scheduler, which typically is a first-in, first-out job queue. All of the commercial Hadoop vendors have their own approach to improving this, and with MapR 2.0, the approach will be to enable more control over where data sits within a given Hadoop cluster.
For anyone familiar with Hadoop's typical methodology of storing data wherever and having the system keep track of data locations, this might seem a bit counterintuitive at first. Norris explained that on a typical homogeneous cluster, Hadoop's native approach works perfectly well.
"But as you expand use cases, and add more departments and different kinds of machines, job management becomes far more complex," Norris said. The multi-tenancy capability of creating logical volumes within physical ones enables Hadoop admins to corral data together within certain locations of a cluster, so that different teams hitting the data will be able to run MapReduce jobs that don't interfere with each other.
Administrators will also be able to specify nodes and limit jobs to those locations in order to enhance job management. Storage parameters will be granularly controlled as well.
These are the kind of feature sets we can expect to see from all of the Hadoop vendors as they work to make Hadoop easier, because it's not always the easiest thing to manage. Adding strong management tools is currently the most effective differentiator between the Hadoop distros. For MapR, their new presence on EMR should also be an attractive option for anyone shopping around.
Read more of Brian Proffitt's Open for Discussion blog and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.