February 21, 2012, 2:27 PM — The focus on big data is amusing to many veterans of data storage systems.
While big data systems offer flexibilities of scale at far more affordable prices then ever before, long-time data experts are also quick to point out that these newfangled big data systems are making use of techniques and processes that were once part of the data warehousing hype that once garnered IT's attention.
Same data warehouse, then, just a new package.
This is not to say there haven't been advances. Data warehousing has moved far beyond "traditional" relational database management systems (RDBMS) on scales that really no one could call anything but "big."
For those new to this latest iteration of data warehousing, here's a look at the various tools that are prevalent in the big data marketplace today.
Database classes
One aspect that makes non-relational, or NoSQL, databases unique is the independence from Structured Query Language (SQL) found in relational databases. Relational databases all use SQL as the domain-specific language for ad-hoc queries, while non-relational databases have no such standard query language, so they can use whatever they want -- including SQL. Non-relational databases also have their own APIs, designed for maximum scalability and flexibility.
NoSQL databases are typically designed to excel in one specific area: speed. To do so, they will use techniques that will seem frightening to relational database users -- such as not promising that all data is consistent within a system all of the time.
Because so much read and write activity is needed in a single relational database transaction, a relational database that could never keep up with the speed and scaling necessary to make a company like Amazon work. What Amazon does with their proprietary non-relational Dynamo database is apply an "eventually consistent" approach to their data in order to gain speed and uptime for their system when a database server somewhere goes down.
Dynamo is part of a class of non-relational databases known as distributed key-value store (DKVS) databases. DKVS is one of five classes that comprise the topology of the NoSQL landscape, each with a different architecture and approach to managing data.
DKVS databases, also known as eventually consistent key-value store databases, are specifically designed to deal with data spread out over a large number of servers. These systems use distributed hash tables for their key-value stores, and because they're distributed, the database uses peer-to-peer relationships between servers, with no "master" control. Currently most of the databases in this class are Dynamo or Dynamo-based implementations of Dynamo, such as the open source Project Voldemort, Dynomite, and KAI databases.
Key-value store (KVS) databases are similar in architecture to DKVS, as the name would imply, where keys are mapped to values. Instead of being distributed across servers, data is held on disk or in RAM. Redis, an open source database that's currently being funded by VMware, is in the KVS family, as are the Berkeley DB and MemcacheDB databases.
Imagine, if you can, a single, giant database table, with embedded tables of data found within. That gives you a fair mental picture of the architecture found within a column-oriented store. Google's BigTable is a well-known example of this class of NoSQL database. Hadoop, Cloudera, and Cassandra are also in this class of data storage system.
Some non-relational databases move away from the table/row/column methodology and store and sort entire documents' worth of data. There are the (predictably named) document-oriented store databases. MongoDB and CouchDB are part of this class, using schemaless JSON-style objects-as-documents to store information as opposed to the more commonly used XML documents.
Finally, there is the graph-oriented store class of NoSQL database. Data is manipulated in an object-oriented architecture, using graphs to map keys, values, and their relationships to each other, instead of just tables. Neo4j is an open source database in this class, as are HyperGraphDB and Bigdata.
Data-processing methodologies
It may seem as if all of these databases are, in and of themselves, able to stand alone as full-fledged databases. In terms of storing data, they most certainly are capable tools. But there's a big difference between storing data and finding and analyzing data.
Relational database systems have this functionality included as part of their overall capabilities, but non-relational systems must be combined with additional tools in order to process data, and turn it into information you can use.
Currently, two methodologies are dominating the NoSQL landscape: MapReduce and enterprise search.














