The focus on big data is amusing to many veterans of data storage systems.
While big data systems offer flexibilities of scale at far more affordable prices then ever before, long-time data experts are also quick to point out that these newfangled big data systems are making use of techniques and processes that were once part of the data warehousing hype that once garnered IT's attention.
Same data warehouse, then, just a new package.
This is not to say there haven't been advances. Data warehousing has moved far beyond "traditional" relational database management systems (RDBMS) on scales that really no one could call anything but "big."
For those new to this latest iteration of data warehousing, here's a look at the various tools that are prevalent in the big data marketplace today.
One aspect that makes non-relational, or NoSQL, databases unique is the independence from Structured Query Language (SQL) found in relational databases. Relational databases all use SQL as the domain-specific language for ad-hoc queries, while non-relational databases have no such standard query language, so they can use whatever they want -- including SQL. Non-relational databases also have their own APIs, designed for maximum scalability and flexibility.
NoSQL databases are typically designed to excel in one specific area: speed. To do so, they will use techniques that will seem frightening to relational database users -- such as not promising that all data is consistent within a system all of the time.
Because so much read and write activity is needed in a single relational database transaction, a relational database that could never keep up with the speed and scaling necessary to make a company like Amazon work. What Amazon does with their proprietary non-relational Dynamo database is apply an "eventually consistent" approach to their data in order to gain speed and uptime for their system when a database server somewhere goes down.
Dynamo is part of a class of non-relational databases known as distributed key-value store (DKVS) databases. DKVS is one of five classes that comprise the topology of the NoSQL landscape, each with a different architecture and approach to managing data.
DKVS databases, also known as eventually consistent key-value store databases, are specifically designed to deal with data spread out over a large number of servers. These systems use distributed hash tables for their key-value stores, and because they're distributed, the database uses peer-to-peer relationships between servers, with no "master" control. Currently most of the databases in this class are Dynamo or Dynamo-based implementations of Dynamo, such as the open source Project Voldemort, Dynomite, and KAI databases.
Key-value store (KVS) databases are similar in architecture to DKVS, as the name would imply, where keys are mapped to values. Instead of being distributed across servers, data is held on disk or in RAM. Redis, an open source database that's currently being funded by VMware, is in the KVS family, as are the Berkeley DB and MemcacheDB databases.
Imagine, if you can, a single, giant database table, with embedded tables of data found within. That gives you a fair mental picture of the architecture found within a column-oriented store. Google's BigTable is a well-known example of this class of NoSQL database. Hadoop, Cloudera, and Cassandra are also in this class of data storage system.
Some non-relational databases move away from the table/row/column methodology and store and sort entire documents' worth of data. There are the (predictably named) document-oriented store databases. MongoDB and CouchDB are part of this class, using schemaless JSON-style objects-as-documents to store information as opposed to the more commonly used XML documents.
Finally, there is the graph-oriented store class of NoSQL database. Data is manipulated in an object-oriented architecture, using graphs to map keys, values, and their relationships to each other, instead of just tables. Neo4j is an open source database in this class, as are HyperGraphDB and Bigdata.
It may seem as if all of these databases are, in and of themselves, able to stand alone as full-fledged databases. In terms of storing data, they most certainly are capable tools. But there's a big difference between storing data and finding and analyzing data.
Relational database systems have this functionality included as part of their overall capabilities, but non-relational systems must be combined with additional tools in order to process data, and turn it into information you can use.
Currently, two methodologies are dominating the NoSQL landscape: MapReduce and enterprise search.
MapReduce is currently the shiny new thing when it comes to big data, and a lot of big data technology relies on this programming interface.
The essential idea of MapReduce is using two functions to grab data from a source, using the Map() function and then process that data across multi-core systems, with the Reduce() function. Specifically, Map() will apply a function to all the members of a dataset, and will post a result set, which Reduce() will then collate and resolve. Map() and Reduce() can be run in parallel and across multiple systems, which means there is a lot of power lent to the processing of data.
Perhaps the best-known example of a MapReduce-based system is Hadoop, which uses MapReduce in combination with the Hadoop Distributed File System to store data effectively. It is important to note that Hadoop, along with the Cloudera commercial implementation and the Apache Cassandra system, are also members of the column-oriented store class of databases. This is no contradiction: the method of data store is separate from the distributed method of data processing. In fact, even though CouchDB lies in the document-storage class of data storage tools, it also uses MapReduce for data processing, just as Aster uses MapReduce atop the relational PostgreSQL database.
While MapReduce-related tools are enjoying their moment in the sun, there are some known drawbacks to using MapReduce. Writing the algorithms for the multi-core data processing can be very complicated, and because the technology is relatively new, many of these query routined have to be hand-coded.
As such, these processing algorithms can be extremely inefficient, requiring lots of processing hardware and storage space to complete a task. The good news is, since most MapReduce systems are based on open source technology, they can be run on commodity systems that are cheap, and easy to just pile into your environment.
And the algorithm sets are getting better; Pig's Pig Latin query language can dive into Hadoop storage systems in small steps to create data flows. Hive's HiveQL does much the same thing, though it's similar enough to SQL to be more familiar to data analysts.
Enterprise search products, such as ElasticSearch, Apache Lucene, and Apache Solr, use a concept called facets that enable you to treat data within documents as you would fields within a relational database. Facets are essentially inverted indexes that let you find specific pieces of information in a document, like an address or other customer information.
Enterprise search is ideal if you have a large set of these types of documents to cull through, and need to do some straightforward data mining or business intelligence analysis. The more structured the data, the better: enterprise search does particularly well with documents like weblogs, which are structured uniformly enough to enable deeper data mining.
The Big Data vendors
Now that you have an idea of the various technologies that are currently part of the Big Data sector, you will have a better context of how the players fit within this sector.
The list below is by no means comprehensive, but is meant to be a jumping off point to identify the key players in big data, and what products and services they offer.
Vendor: 1010 data Location: New York, NY Products/Services: Hosted analytical platform for big data, using big table-type data structures for consolidation and analysis.
Vendor: 10gen Location: New York, NY; Palo Alto, CA; London, Great Britain; Dublin, Ireland Products/Services: Commercial support and services for MongoDB.
Vendor: Acxiom Location: Various global locations Products/Services: Data analytics and processing, with an emphasis on marketing data and services.
Vendor: Amazon Web Services Location: Global Products/Services: Provider of cloud-based database, storage, processing, and virtual networking services.
Vendor: Aster Data Location: San Carlos, CA Products/Services: Data analytic services using Map/Reduce technology.
Vendor: Calpont Location: Frisco, TX Products/Services: InfiniDB Enterprise, is a column-sorted database that also provides massively parallel processing capabilities.
Vendor: Cloudera Location: Palo Alto and San Francisco, CA Products/Services: Distributor of commercial implementation of Apache Hadoop, with services and support.
Vendor: Couchbase Location: Mountain View, CA Products/Services: Commercial sponsor of the Couchbase Server Map/Reduce-oriented database, as well as Apache CouchDB and memcached.
Vendor: Datameer Location: San Mateo, CA Products/Services: Data visualization services for Apache Hadoop data stores.
Vendor: DataSift Location: San Francisco, CA; Reading, United Kingdom Products/Services: Social media data analytical services. Licensed re-syndicator of Twitter.
Vendor: DataStax Location: San Mateo, CA; Austin, TX Products/Services: Distributor of commercial implementation of Apache Cassandra, with services and support.
Vendor: Digital Reasoning Location: Franklin, TN Products/Services: Synthesys, a hosted and local business intelligence data analysics tool.
Vendor: EMC Location: Various global locations Products/Services: Makers of Greenplum, a massively parallel processing data store/analytics solution.
Vendor: esri Location: Various global locations. Products/Services: GIS data analytical services.
Vendor: FeedZai Location: United Kingdom Products/Services: FeedZai Pulse, a real-time business intelligence appliance.
Vendor: Hadapt Location: Cambridge, MA Products/Services: Data analytic services for Apache Hadoop data stores.
Vendor: Hortonworks Location: Sunnyvale, CA Products/Services: Distributor of commercial implementation of Apache Hadoop, with services and support.
Vendor: HPCC Systems Location: Alpharetta, GA Products/Services: HPCC (High Performance Computing Cluster), an open source massive parallel processing computing database.
Vendor: IBM Location: Various global locations Products/Services: Hardware; data analytical services; and db2, a massive parallel processing database.
Vendor: impetus Location: San Jose, CA; Noida, India; Indore, India; Bangalore, India Products/Services: Data analytic and management services for Apache Hadoop data stores.
Vendor: InfoBright Location: Toronto, ON; Dublin, Ireland; Chicago, IL Products/Services: InfoBright, a column store database, with services and support.
Vendor: Jaspersoft Location: Various global locations Products/Services: Data analytic services for Apache Hadoop data stores.
Vendor: Karmasphere Location: Cupertino, CA Products/Services: Data analytic and development services for Apache Hadoop data stores.
Vendor: Lucid Imagination Location: Redwood City, CA Products/Services: Distributor of commercial implementation of Apache Lucene and Apache Solr, with services and support. Provider of LucidWorks enterprise search software.
Vendor: MapR Technologies Location: San Jose CA; Hyderabad, India Products/Services: Distributor of commercial implementation of Apache Hadoop, with services and support.
Vendor: MarkLogic Location: Various global locations Products/Services: Data analyic and visualization services.
Vendor: Netezza Corp. Location: Various global locations Products/Services: Massively parallel processing data appliances, analytic services.
Vendor: Oracle Location: Various global locations Products/Services: Various hardware and software offerings, including Big Data Appliance, MySQL Cluster, Exadata Database Machine.
Vendor: ParAccel Location: Campbell, CA; San Diego, CA; Wokingham, United Kingdom Products/Services: Data analyics using column-store technology.
Vendor: Pentaho Location: Various global locations Products/Services: Data analytic services for Apache Hadoop data stores.
Vendor: Pervasive Software Location: Austin, TX Products/Services: Data analytic services for Apache Hadoop data stores based on Hive.
Vendor: Platform Computing Location: Various global locations Products/Services: Distributor of commercial implementation of Apache Hadoop, with services and support.
Vendor: RackSpace Location: Global Products/Services: Provider of cloud-based database, storage, and processing services.
Vendor: Revolution Analytics Location: Palo Alto, CA; Seattle, WA Products/Services: Data analytic and visualization services using R-based software.
Vendor: Splunk Location: Various global locations Products/Services: Data analytic and visualization services using logging-oriented software.
Vendor: Tableau Software Location: Seattle, WA; Kirkland, WA; San Mateo, CA; Surrey, United Kingdom; Paris, France Products/Services: Business intelligence and data analytic software.
Vendor: Talend Location: Various global locations Products/Services: Database management software.
Vendor: Teradata Location: Miamisburg, OH Products/Services: Database management software.
Vendor: Vertica Systems Location: Billerica, MA Products/Services: Data analytics using column-store based technologies.
The big data landscape is definitely marked with a lot of different systems and tools, but the one common thread all of these systems have is the capability to process a lot of data and very quickly.
Whether you call it data warehousing or big data, the reality is that data is driving business, and the sooner your organization can make use of the tools that can best handle data, the faster your business should grow.