First and foremost, he says, the Hadoop platform is defined by its scalability. It works just fine on small datasets stored in-memory, but is capable of scaling massively to handle huge datasets.
"A big component of scalability that we don't hear a lot talked about is affordability," he says. "We run on commodity hardware because it allows you to scale further. If you can buy 10 times the amount of storage per dollar, then you can store 10 times the amount of data per dollar. So affordability is key, and that's why we use commodity hardware, because it is the most affordable platform."
Just as important, he notes, Hadoop is open source.
"Similarly, open source software is very affordable," he adds. "The core platform that folks develop their applications against is free. You may pay vendors, but you pay vendors for the value they deliver, you don't keep paying them year after year even though you're not getting anything fundamentally new from them. Vendors need to earn your trust and earn your confidence by providing you with value over time."
Beyond that, he says, there are what he considers elements of Hadoop's style.
"There's this notion that you don't need to constrain your data with a strict schema at the time you load it," he says. "Rather, you can afford to save your data in a raw form and then, as you use it, project it to various schemas. We call this schema on read.
Another popular theme in the big data space is that oftentimes simply having more data is a better way to understand your problem than to have a more clever algorithm. It's often better to spend more time gathering data than to fine-tune your algorithm on a smaller data set. Intuitively, this is much like having a higher-resolution image. If you're going to try to analyze it, you'd rather zoom in on the high-resolution image than the low-resolution image."
HBase Is an Example of Online Computing in Hadoop
Batch processing, he notes, is not a defining characteristic of Hadoop. As proof he points to Apache HBase, the highly successful open source, nonrelational distributed database-modeled on Google's BigTable-that is part of the Hadoop stack. HBase is an online computing system, not a batch computing system.