Try out your Hadoop app on the world's largest cluster
EMC's Greenplum Analytics Workbench enables the Apache Hadoop open source community to validate code to scale on a regular, ongoing basis, for free
Are you looking to be on the cutting edge of Big Data? How would you like to test and refine your Hadoop application to see if it can handle the largest known cluster? Then you might be interested in what EMC's Greenplum unit is doing in its Las Vegas data center, where anyone can make use of their facility for free. Yes, you read that correctly. It has been in operation for less than a year, and is already getting rave reviews from more than a dozen different customers from all over the world.
Image credit: EMC
The cluster claims to be the largest of its kind that is available to the general public and contains 1,000 specialized servers from Super Micro Computer running dual Intel Xeon processors with 48GB of RAM apiece and a total of 12,000 Seagate 2 TB drives. Connecting everything up areMellanox Technologies 40GB Ethernet adapters , and various pieces of VMware virtualization software. And for those of you familiar with Hadoop, they have the latest versions of tools such as MapReduce, Hive, Pig, and Mahout.
GREENPLUM ANALYTICS WORKBENCH
What it isA Las Vegas data center with 54 racks of equipment
When it openedJuly 2012
What it runs1,000 Super Micro computers with dual Intel Xeon X5670 CPUs; Mellanox Technologies 40GB Ethernet adapters; 12,000 Seagate 2 TB hard drives
What tools it offersHadoop software, including MapReduce, HDFS, Hive, Jenkins, Plato, Hbase, Pig, and Mahout
How to apply Submit your name via the "learn more" link on this page, and a "Jedi Council" of 10 staffers sifts through each applicant to determine priority and to schedule them on the cluster.
What it costs$0
That all may seem high-powered to those of us who are still amazed at (single) gigabit Ethernet speeds, but the cluster is actually pushing at the limits of its components already. "We wish we had Intel Sandy Bridge processors. We have found that some of our applications have reached the bottlenecks of second generation PCIe, at 40GBs," said Clinton Ooi, one of the Greenplum engineers. Think about that for a moment: 40-gigabit networks are now a bottleneck!
Ooi is one of four full timers that run the cluster, which goes under the name of Analytics Workbench. If you're interested in using the cluster, click on the "learn more" button, and submit your name.
Lots of people have clicked on that link already. The team gets at least one new request a day to run different applications. These requests are carefully scrutinized, as you might imagine. Given this demand, Greenplum uses a "Jedi Council" of 10 staffers who sift through each applicant to determine their priority and schedule them on the cluster. Not everyone wants all one thousand nodes for their tests, and the Workbench has been set up to run concurrent smaller jobs by multiple tenants
"We take a fair amount of time to make sure what someone wants to do, and that it makes sense to use the cluster. Timing is everything," says Apurva Desai, a senior director of Hadoop Engineering at Greenplum.
Why give free access? "We want to give back to the Hadoop community, and learn something. We might eventually also contribute something back to the open source code," said Desai. That is a nice sentiment, but since opening their doors, Greenplum has had some interesting customers using their cluster.
One of their customers last year was a consultant who was testing out a specialized Hbase application called Accumulo for a certain federal government agency. (If you click on the link you can figure out which one.) "It was a very complex task and took them several months to figure out the test processes and how to create the data set. But they learned how to tweak their algorithms and how to deploy it on their own cluster," said Desai.
Other customers include NASA that was using the cluster to analyze historical weather patterns, Alpine Data Labs and Informatica.
Alpine is a Silicon Valley-based data science software company and has been a regular user of the Analytics Workbench for some of its banking and healthcare customers. Steven Hillion is the chief product officer of the firm. "We do data mining, but not just for eggheads," he said "We are trying to build collaborative analytics solutions where business analysts and engineers are working with the data to provide insights." Alpine's platform and tools simplify the process of building very complex mathematical models of problems with many terabytes of data. These typically run across billions of rows and approach millions of variables.
Given the size of their data sets, they jumped at the chance to use the cluster when it was first announced. "We want to make sure that the algorithms that we are implementing really do scale to large data sets," he said. Alpine has their own Hadoop clusters in-house, but they need access to tens or hundreds of machines to see what happens to their applications when all the data is loaded in. They were reluctant to take their algorithms directly to their customers without first proving them and running with scalability and performance tests. This is what they did before the Greenplum cluster came on line.
"Most of our testing had been limited to about 10 machines previously," Hillion said. "We need to test on several different flavors of Hadoop too. We try to develop in a way that is agnostic and use common APIs. Greenplum has been good about using open standards and easy to work with. The cluster provides us with an extra level of comfort and helps us identity bottlenecks in extremely large data sets." Plus, they don't have to worry about the size of their data sets overwhelming the cluster.
"Taking into account all the hardware, networking and personnel, we wouldn't even think about building something like this ourselves," he said. And the Workbench is unique: "I don't know of any other independent software companies who are testing these sorts of algorithms on Hadoop at these levels of scale. " So far, they have been impressed with the quality of the cluster and downtime has been minimal.
Alpine has found three different benefits after spending months using the Greenplum cluster:
1. Algorithms that work and scale and fine-tuning them to get that last drop of extra performance.
2. Taking algorithms that work and scale and trying them out under a variety of circumstances and particularly complex data structures to find hidden bugs.
3. Finally, prototyping new algorithms that might just not perform well enough initially, and then going back to the drawing board and trying out different approaches. "The Workbench becomes a great way to experiment quickly and produce real innovation," he says.
Australian data integration consulting firm Informatica had access to the Workbench for approximately three months. "During our peak usage we were using it on a daily basis running several jobs a day. The jobs took from roughly one to seven hours in duration. The purpose of our work was scalability testing so we were comparing runs exercising different levels of cluster resources," says Robert Marshall, one of the analysts at the firm. They were able to find bottlenecks using hundreds of nodes that weren't apparent with their five-node in-house Hadoop cluster, and identify areas for future optimization and improvement.
"Overall the Greenplum setup was very good. We were able to set up up our jobs and scripts very quickly. There were very few changes we had to make to our scripts in order to get up and running in their environment," Marshall said. And even though they were connecting from Australia, they didn't see any significant lag or network latencies across the links either.
Greenplum has plans to use the cluster for a new series of Hadoop training classes, although so far that is still in the future. In the meantime, get your applications in now if you want to be considered for its use in the future and be part of this new wave of Big Data science.
IF YOU DECIDE TO TRY IT OUT ...
Understand what you want to test, and how many nodes or a range that you'll need
Specify what additional Hadoop tools or versions you'll need
Run your test on your own Hadoop cluster first to establish benchmarks with a subset of your entire data set
Get your application in early as demand for the Workbench is high!