March 27, 2012, 12:14 PM — DreamHost is regarded as one of the stronger web hosting outfits available on the Internet today. But the company may soon be more well-known as the backer of the Ceph distributed file system, which has quietly been ramping up to provide a strong alternative to Hadoop clusters for storage applications.
One of the most-touted benefits of using Hadoop is its deep connection to the Hadoop Distributed File System (HDFS), which provides the on-commodity data storage so coveted by IT managers. But HDFS is not the only player in the big data storage game, and it can carry some limitations.
The most well-known limitation with which Hadoop has to contend is probably the single name-node architecture of any given Hadoop cluster. In any such cluster, there is just one name node machine that (analogous to a file allocation table on a single hard drive) uses metadata to track where data is actually sitting on the data nodes in the cluster.
This setup immediately causes two potential problems: first, there's the single point of failure issue. If your name-node goes bye-bye, so does your cluster (unless you've built in some failover configurations). Then there's the hard ceiling on metadata storage. Eventually, that single name node machine is going to fill up with metadata, which sets a limit on how much data a cluster can have.
Since the single name node means Hadoop clusters can't scale up indefinitely, these problems are traditionally solved by scaling out. Add more clusters, the logic goes, and you build in failovers and more room for storage. Or, as in the case of a Hadoop distribution like MapR, the architecture is reconfigured to remove the single name-node choke point.
A team of computer scientists and engineers are trying to go even farther with Ceph, an open source distributed network storage system with components that include its own high-performance, POSIX-compatible file system (CephFS) and an object-storage layer that enables applications to directly store and access data directly, wherever that data is stored in the cluster.
And it's all spearheaded at DreamHost.
This may seem like an unlikely source for what could become the next big thing in big data and storage. That is, until one meets DreamHost co-founder and Ceph Architect Sage Weil.
Weil is one of those people that, upon meeting him, immediately conveys the sense that you are not the smartest person in the room any more. It's not through any sense of arrogance… just a quiet certainty about what he's doing and where technology needs to be heading.
Ceph is actually something that Weil started while DreamHost and its companion company New Dream Network were in their early stages. Weil initially became involved with storage systems architecture while he was a graduate student at UC Santa Cruz (and later the Lawrence Livermore National Labs), working on a US Department of Energy-funded project for a high-performance petabyte-scale computing system.
As Weil completed the DoE project, it became clear to him that much of the underlying technology could be applied to distributed systems, using an object-oriented architecture that would enable high scalability and make life a lot easier for programmers, since communicating with data objects is typically better for programmers to handle than getting apps to talk to data recordsets. Weil would shift this work to become the open source Ceph project.
Author's Note: The following paragraph has been corrected to clarify the connection between Ceph and OpenStack. Despite some reports in the media, there is no official relationship between the two projects.
Weil gets the most excited when describing the community that's grown around the Ceph project, which is licensed under the LGPL. Ceph software has been used within OpenStack cloud computing project, as developers there experiment with Ceph's functionality.
The open aspects of Ceph, for Weil, is one of the best.
"There is a problem with the storage industry now," Weil explained. "Storage is either locked-in by vendors or dependent on appliances."
Ceph's open model eliminates this problem altogether, he added. The fact that Ceph is POSIX-compatible certainly helps too, since anything that can be stored and accessed by Linux or Unix can be accessed on Ceph. This makes it very friendly for Linux developers right from the gate.
And Ceph is about to become even more widely known, with launch of a new cloud-storage spinoff company of the same name in April. This could make DreamHost a much bigger player among hardware vendors looking for an inexpensive non-Hadoop storage platform.
With the availability of a general storage platform like Ceph, big data is about to become a whole lot bigger.
Read more of Brian Proffitt's Zettatag and Open for Discussion blogs and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.