Best practices for selecting storage services for big data

By Frank J. Ohlhorst, CIO |  Big Data

Disk storage is a lot like closet space—you can never have enough. Nowhere is this truer than in the world of big data. The very name—"big data"—implies more data than a typical storage platform can handle. So where exactly does this leave the ever-vigilant CIO? With a multitude of decisions to make and very little information to go by.

However, wading through the storage options for big data does not have to be an impossible journey. It all comes down to combining some basic understanding of the challenge with a little common sense and a sprinkle of budgetary constraint.

What Makes Big Data a Big Deal

First of all, it is important to understand how big data differs from other forms of data and how the associated technologies (mostly analytics applications) work with it. In itself, big data is a generic term that simply means that there is too much data to deal with using standard storage technologies. However, there is much more to it than that—big data can consist of terabytes (or even petabytes) of information that can be a combination of structured data (databases, logs, SQL and so) and unstructured (social media posts, sensors, multimedia) data elements. What's more, most of that data can lack indexes or other organizational structures, and may consist of many different file types.

That circumstance greatly complicates dealing with big data. The lack of consistency eliminates standard processing and storage techniques from the mix, while the operational overhead and sheer volume of data make it difficult to efficiently process using the standard server and SAN approach. In other words, big data requires something different: its own platform, and that is where Hadoop comes into the picture.

Hadoop is an open source project that offers a way to build a platform that consists of commodity hardware (servers and internal server storage) formed into a cluster that can process big data requests in parallel. On the storage side, the key component of the project is the Hadoop Distributed File System (HDFS), which has the capability to store very large files across multiple members in a cluster. HDFS works by creating multiple replicas of data blocks and distributing them across compute nodes throughout a cluster, which facilitates reliable, extremely rapid computations.


Originally published on CIO |  Click here to read the original story.
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question