June 22, 2009, 9:04 PM — I am privileged to serve as co-chair of the Cloud Services SIG for a Silicon Valley-based non-profit, the SDForum, which is a great resource for technologists, entrepreneurs, and investors to meet and investigate new technologies. We've been running the SIG since January, and it's been a great experience to see what people are doing with cloud computing (if you're located in Silicon Valley, please come to one of our meetings; you'll enjoy it and learn a lot). In addition, just by virtue of being located in Silicon Valley, I get the opportunity to see lots of great new technologies - like yesterday, when I attended the Amazon Web Services Start-Up Event at the PlugandPlayTechCenter in Sunnyvale.
It is striking about how companies are leveraging cloud computing to create new products or services. I thought I would write about a few of them this week just to give an insight about how people are taking advantage of the characteristics of cloud computing.
Big Data: As you know, I am a big believer in the big data theme - that organizations are moving beyond transactions and into relationships and content, thereby exponentially increasing the amount of data under storage - and requiring much more (and deeper) analytics. Moreover, the traditional tools used to manage data, both from a pure storage perspective as well as a tool perspective (i.e., database engines, etc.) don't scale very well, either technically or economically. At the Cloud Services SIG last month, we had several companies presenting that discussed how they integrate with the cloud to better address the big data problem.
First off, we had a Google representative, who discussed Google Datastore, which is a robust key/value pair storage mechanism designed to provide massive scalability. While not offering the extensibility or flexibility of a relational database, Google Datastore addresses common cloud storage requirements, which are typically very large amounts of relatively simple data.
We then heard from Cloudera, which distributes a supported Hadoop distribution. Hadoop is a great tool to enable parallel processing of very large amounts of data with an aim of performing a relatively simple operation on some portion of the data. Hadoop, which I wrote about a few months ago, has a distributed file system that is redundantly spread throughout a set of servers, which is used to store and retrieve data. A query is launched against the data store, executing a map/reduce function, which then performs some operation on the resulting data set. Hadoop is widely used for very large data sets that outstrip the capacity limits of traditional databases and data warehouses. Incidentally, as I mentioned in my earlier blog piece, Amazon offers Hadoop functionality directly as an AWS offering.
The last speaker of the night represented Aster Data, a parallel database company that is self-organizing (that is to say, if a new server is put into the parallel pool, the data automagically repartitions itself without need for manual intervention). Aster Data can be run in a cloud environment - indeed, with its ability to incorporate new servers, it is particularly well-suited for a cloud environment. The product also includes map/reduce functionality, which provides great flexibility in allowing a developer to decide at run-time which type of query is best for a particular task. By the way, at next week's Cloud Services SIG, Greenplum will present. Greenplum is somewhat analogous to Aster Data, and recently announced a cloud product that is oriented toward enabling organizations to build an internal cloud of data warehouse capacity that can be doled out as needed. To give an idea of how large the data sets companies are now addressing, Ebay uses Greenplum to manage a 6 Petabyte data warehouse with 17 trillion (!) rows (that's a lot of Canned Cloud - read down the page a bit).
Turning to the Amazon event, four Amazon customers presented and discussed their use of cloud computing (my discussion of the following is from notes and memory, as the slides are not yet available). One company was ShareThis, which allows people to share interesting content with friends and colleagues. ShareThis keeps track of all the share events (one might think of these as transactions - Person A shares Content X with Person B, there's three data elements to keep track of and aggregate for statistics). The numbers of events ShareThis has in its data store is mind-boggling; it uses Amazon SimpleDB to track all of them.
The second company was really interesting - Pathwork Diagnostics, a biotech firm that uses Amazon to evaluate oncology diagnostics. The presenter said that they run large Hadoop-based queries on 240 Amazon EC2 instances for a couple of days, and then shut them down. This is another instance of big data that could not be easily processed in traditional fashion.
Next up was SmugMug, which offers a photo upload and sharing service. Another big data story, but with a twist. SmugMug's data challenge is not a searching issue; rather it is a physical capacity issue. Digital photos aren't necessarily that large, but in the quantities that SmugMug deals with, total storage requirements are immense. SmugMug relies on Amazon's S3 service for storage. The presenter, a former data center ops guy, was asked if he didn't miss having his own equipment. He seemed to sigh for a moment, reminiscing (it appeared to me) about the good old days of racks of equipment, then quickly shuddered when he thought about how much equipment would be necessary. He also made mention of the fact that, as a self-funded company, investing in large amounts of equipment would be cost-prohibitive.
The last, but certainly not least, speaker represented Netflix. Yes, the DVD-by-mail company. Except it isn't just DVDs-by-mail anymore. You can view a significant part of the Netflix inventory online - and Amazon is used as part of that process. Every digital video object must be encoded to run in the Netflix viewer, since the native format is not supported (nor secure) for remote viewing. Netflix leverages Amazon EC2 instances to perform that encoding.
A couple of questions were posed at the Amazon event regarding cost of running the Amazon service. Since the criticism of external clouds is they must be more expensive than internal data resources, the questions make sense. Because several of these companies are operating at very large scale, one might think that whatever crossover point exists at which internal resources are less expensive than outside resources must have been crossed. According to the presenters, using Amazon still made economic sense, even at the scale of computing they were implementing. For SmugMug, attempting to obtain and manage the resources necessary to store all the digital assets under management would be prohibitive (the presenter said that SmugMug only has around fifteen employees, so employing enough people to manage enough hardware to store all the assets isn't possible in any reasonable economic scenario). The presenter from Pathword noted that the alternative for his company wouldn't be 240 servers, since they couldn't afford them, it would be two or four servers; the tradeoff would be that the jobs would take weeks instead of hours, which for a startup seeking competitive advantage is unacceptable.
The applications described in this piece represent products designed to take advantage of cloud computing characteristics. Instead of people creating the same type of apps they would have done in the resource-constrained world of an internal data center, they leverage the scalability and ability to shut off capacity when it's no longer required. This approach - creating applications designed for cloud environments instead of hosting data center-oriented apps in a cloud environment - is what we mean when we discuss with clients the importance of creating cloud applications, not just putting applications in the cloud. When thinking about the cloud, it's important not to consider it as just a data center with a different IP address; it provides the opportunity for computing-based innovation.
Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.
Cloud Computing Seminars HyperStratus is offering three one-day seminars. The topics are: 1. Cloud fundamentals: key technologies, market landscape, adoption drivers, benefits and risks, creating an action plan 2. Cloud applications: selecting cloud-appropriate applications, application architectures, lifecycle management, hands-on exercises 3. Cloud deployment: private vs. public options, creating a private cloud, key technologies, system management The seminars can be delivered individually or in combination. For more information, see http://www.hyperstratus.com/pages/training.htm
Follow everything from CIO.com on Twitter @CIOonline