Massive data volumes making Hadoop hot

Complex data analytics requirements are driving interest in open source Hadoop technology, say users and analysts

By , Computerworld |  Open Source, Analytics, data management

Relational database technologies focus the speed of data retrieval, complex query support and transaction reliability, integrity and consistency. "What they don't do very well is to accept new data quickly," he said.

"Hadoop reverses that. You can put data into Hadoop at ridiculously fast rates," he said. Hadoop's file structure allows companies to essentially capture and consolidate pretty much any structured and complex data type, such as web server logs, metadata, audio and video files, unstructured e-mail content, Twitter stream data and social media content, he said.

The technology therefore is ideal for companies looking to analyze massive volumes of structured and unstructured data.

Retrieving raw data from the HDFS and processing it, however, is not nearly easy or as convenient as typical database systems, because the data is not organized or structured, Befus said. "Essentially what Hadoop does is to write data out in large files. It does not care what's in the files. It just manages them and makes sure that there are multiple copies of them."

Early on, users had to write jobs in a programming language like Java in order to parse and then query raw data in Hadoop. But tools are now available that can be used to write SQL-like queries for data stored in Hadoop, Befus said.

Tynt uses a popular tool called Pig for writing queries to Hadoop. Another widely used option is Hive.

According to Befus, Hadoop's architecture makes it ideal for running batch processing applications involving 'big data.'

Hadoop can be used for more real-time business intelligence applications as well.

Increasingly, companies like OpenLogic have begun using another open source technology called HBase on top of Hadoop to enable fast querying of the data in HDFS. HBase is a column-oriented Hadoop data store that enables real-time access and querying of the data in Hadoop.

OpenLogic offers enterprises a service for verifying that open source code is properly attributed and is in full compliance with open source licenses.

To deliver the service, OpenLogic maintains a comprehensive database of hundreds of thousands of open source packages. The company stores metadata, version numbers and revision histories is stored on a Hadoop cluster. The data is accessed via HBase.

Rod Cope, CTO of OpenLogic, said the company gets the best of both worlds with Hadoop. "A lot of the data we have won't fit into a RDBMS like MySQL and Oracle. So the best option out there is Hadoop," he said.

By running HBase on top of Hadoop, OpenLogic has also been able to enable real-time data access in nearly the same manner as conventional database technologies, he said.

Originally published on Computerworld |  Click here to read the original story.
Join us:






Spotlight on ...
Online Training

    Upgrade your skills and earn higher pay

    Readers to share their best tips for maximizing training dollars and getting the most out self-directed learning. Here’s what they said.


    Learn more

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Ask a Question