Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

By Peter Wayner, InfoWorld |  Data Center/Servers Add a new comment

It's been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss's boss. That puts it in the sweet spot when there's plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won't be long before "Hadoop cluster" takes over the role that the words "crazy supercomputer" used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.

The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There's HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.

[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Read about InfoWorld's 2012 Technology of the Year Award winners. | Read about InfoWorld's top 10 emerging enterprise technologies. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

The open source projects are just the beginning -- a surprisingly large number of companies are emerging with the plan of helping people actually use Hadoop. Some are just selling support, and others are building their own tools that sit alongside Hadoop and make it easier to use.

This kind of competition is usually seen as open source at its best. There is a core collection of packages that serve like a standard to keep everyone in synchrony. Each of the groups is competing to add the right sauce that will attract customers, both paying and nonpaying. There continues to be controversy over just how much is rolled into the central collection, as there can be in any major open source project, but the amount of experimentation is so large that it's hard to be too focused on the amount of sharing.

To get a feel for the excitement, I took four major collections out for a test-drive. I powered up a cluster of nodes on Rackspace, installed the tools, pushed the buttons, and ran some sample jobs. It's getting to be surprisingly easy to spend a few pennies for an hour or two of machine time -- so much so that I found myself debating whether it was worth leaving my cluster idling over lunchtime. Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.

The not-so-good news is that these collections are far from perfect. None of the tools I tried worked exactly as promised. There were always small glitches. I often found myself reading the log files and paging through endless lists of Java stack dumps. (Someone is going to have to apply Hadoop to analyzing the endless stack dumps. They're getting so involved that I doubt a single machine will be able to parse them for much longer.) After a few seconds, I could usually get things on track again. These tools may not require someone with much experience to use once they're running, but they can't be installed unless you're fairly adept with the ways that the Java stack is organized.

Despite these impediments, I spent most of my time churning through data. The good news is that all of these tools make it pretty easy to get a cluster of computers working together to solve problems. Using these tools is much easier than downloading and configuring the source code yourself. They're designed to be one-button applications, and they come close to achieving that goal.


Originally published on InfoWorld |  Click here to read the original story.

ITworld LIVE

Data Center/ServersWhite Papers & Webcasts

White Paper

The Forrester Wave™: Disaster Recovery Services Providers

Improvements in disaster recovery plans and broad business continuity strategies are top-of-mind concerns for leading enterprises today and recovery time is now measured in hours and minutes not days. These key insights are discussed in the 2010 Forrester Wave Report.

White Paper

Roadmap to the Cloud Summary HP Brochure

This white paper reveals the key steps you need to take in order to build an effective cloud computing infrastructure. Start building your cloud step-by-step today.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Forrester Whitepaper: IT Operations Managers Must Rethink Their Approach to Private Cloud

Organizations of all types are attracted by the promises of private cloud computing, but few actually have the virtual maturity to be successful. This Forrester report reveals the latest virtualization trends so you can see how far your peers are in their journey to the private cloud. Read on and discover best practices for improving virtualization in order to prepare for the cloud.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Building Cloud-Optimized Data Center Networks white paper

Enterprises are turning to the Cloud to improve business agility, reduce expenses and accelerate business innovation. Cloud computing redefines the way IT assets are deployed and consumed and dramatically affects the way data center networks are architected and managed. Conventional hierarchical data center networks built to support traditional IT architectures can't meet the security, agility and price/performance requirements of virtualized cloud computing environments. This white paper reviews the impact of cloud computing on data center networks and describes HP's approach to building simpler, more secure and automated networks that fully meet the stringent performance, security, reliability and agility demands of the new data center in the Cloud.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Seven Priorities for Integrated Network Management - How HP Intelligent Management Center Delivers an Enterprise-class Solution

This white paper describes the major requirements for network management solutions to help the organizations become more profitable, efficient and reliable.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

See more White Papers | Webcasts

Ask a question

Ask a Question