June 09, 2011, 8:30 AM —
Source: Phil Hawksworth/Flickr
As companies use the Web to build new applications, and as the amount of data generated by them increases, they are reaching the limits of traditional relational databases. A set of alternatives, grouped under the umbrella label NoSQL (for not only SQL), has become more popular and a number of notable use cases, including social networking giants Facebook and Twitter, are leading the way in this arena.
Some of the datasets are enormous: for example, when Visa was looking to process two years' worth of credit card transactions -- some 70 billion of them, -- they turned to a NoSQL solution and were able to cut their processing time from a solid month using traditional relational solutions to just 13 minutes.
So what is NoSQL and why should you bother? SQL (for Structured Query Language) has been around for decades and has plenty of benefits. It's great for centrally managed database schemas, easy ad hoc queries, and data that can be indexed and normalized. There are lots of integrated SQL development environments, reporting tools, and ways to extract, import and transform SQL databases, and SQL has been taught to thousands of software engineers. Most SQL databases can easily run on common Intel hardware with RAM and disk storage in reasonable amounts (250 GB and 1 TB, respectively, are about right).
But SQL also has its limits, especially when you consider how modern Web apps are built. They have to perform and scale well, and handle large collections of documents and odd kinds of data -- and none of that plays to SQL's strengths. Think of your standard SQL database as a series of tables. Each row is a data record and the columns are fields for each record. This works well if the fields are all somewhat similar in terms of length and data types, such as the fields for a typical address record for a customer. But it falls apart when you have large blobs of data, such as a document that has to be attached to a particular record, or a comment field that can be open ended and contain hundreds or thousands of characters.
SQL also falls apart when a particular application needs to scale up quickly -- if, for instance, it needs to keep track of sales for a Web store that is featured on a TV commercial or the front page of a popular search engine, or deal with terabytes of credit card purchases. In those situations, you need to add processing power and storage quickly, or the database won't be able to handle the queries.
Scaling when needed
This is where NoSQL plays well. It can scale up quickly, can work well with the Web and other online programming environments, and can be made fault-tolerant and more flexible to anticipate needs to change your database schema or repartition your data.
A good example is the set of databases behind the Twitter service. Twitter users generate more than 12 TB of data every day. Even if Twitter used the fastest disk drives, it would take more than 40 hours to record this information. The company ended up using Hadoop, a distributed file system with automatic replication and fault tolerance. (Visa also used Hadoop in the scenario discussed above, and Yahoo has a 4,000 node-Hadoop cluster.) Hadoop allows Twitter to distribute the load involved in writing all these Tweets across a lot of inexpensive servers.
NoSQL comes in a variety of shapes and sizes, and includes more than a dozen different open source projects that can handle a variety of circumstances:
"These models let the developer start working with their data without committing to anywhere near as much up-front structure as a relational database. On initial usage, that makes them feel much lighter-weight and agile from a developer point of view," says Alex Miller, a St. Louis-based developer and organizer of a popular conference called "Strange Loop" that features many of the NoSQL luminaries from Twitter, Amazon, and Google.
Which is right for you?
"The important thing is to consider your data use case and find the most appropriate technology to match it," says Miller. "Sometimes SQL is overkill for what you are trying to do. You don't need a query language, just a key with a particular value, and that could be a lot simpler and faster," says Gary Nakamura, the VP of World Wide Sales for Terracotta, one of the more popular NoSQL solutions.
Developers need to break down their database needs into various components. Most importantly, you need to figure out if you're most concerned with database reads or writes, or if you care about both equally. Most SQL developers don't really think about how much of their operations are reading or writing data, because their databases are built to handle consistent records and transactions. With the NoSQL crowd, you don't necessarily have read/write consistency. Think of a blog that takes comment posts. You don't need to immediately post a comment, and, since a comment is mostly text, you don't really need a relational database to handle the organization of comments. Instead, you can batch up the past hour's or day's comments and process them all at once, making your operations -- and your blog code base -- more streamlined, too. "There are different versions of NoSQL that are optimized for read-only or for writing data. All of them trade off referential or transaction integrity for performance," says Bob Matsuoka, principal consultant at MokaMedia Partners in New York City who has used Lucene, CouchDB, and MongoDB among other NoSQL solutions.
What is important is that NoSQL and SQL aren't mutually exclusive, and indeed many shops will use both and combine the best of both worlds. "You need to use the application layer to stitch the two together for best results," says Steve Holzinger, who works for Scholastic Education in Boston and is also a MongoDB user.
Which NoSQL project you will deploy depends on several factors:
Fault tolerance. Some databases like Cassandra do automatic read repairs, sending a read request to all replicas and then resolving conflicts if they differ and updating things in the background.
Distributed servers. "MongoDB is more centralized while CouchDB is better suited to intermittently connected devices like mobile phones," says Holzinger.
Learning curve. Some of the NoSQL projects are easier to learn to use than others. Hadoop is very difficult to learn, while others, such as DHCache, are dirt simple. In any event, there are lots of YouTube videos and PowerPoint slide decks that can help assist a developer in getting started. "For the most part, these tools are simple to adopt and easy to understand, particularly with using their own languages if you are already familiar with SQL," says Holzinger. "NoSQL can be considered a specialty tool to supplement traditional SQL databases in special cases, and is much better than using a raw file system or a homegrown solution."
Scalability. Part of the promise of NoSQL is that some of the projects are designed from the get-go to handle large amounts of data, and incrementally add processing power as the datasets grow. "SQL can't really scale very easily, particularly if you are looking at the scale required for a public Web application," says Matsuoka. "Whereas a product such as CouchDB is designed to scale."
"Developers are always experimenting and want to use the newest and shiniest stuff, but you care about the security of your data," says Jimmy Guerrero, the product marketing manager for RedHat's OpenShift cloud solutions. "While it is nice to have cloud features, at the end of the day, your data has to be available. The tipping point is that Google and Twitter, by the nature of their businesses, can't be done with an Oracle database cost effectively. Once enterprises realize that they have to start offering Web-oriented customer-facing services, they can't use existing SQL technology and will need to start looking at these tools."
"NoSQL encourages a more holistic approach to managing the data lifecycle in an application and puts the developer more in control over more aspects of this lifecycle," says Miller. Clearly, it is here to stay and developers should start learning at least one of the tools to be current. Look for projects that have substantial backers, such as RedHat's OpenShift, or Hadoop or Cassandra from Apache, or MongoDB from 10gen, as places to start. You might also want to check out ThoughtWorks' NoSQL comparison or information on last year's Strange Loop conference.