June 08, 2011, 7:26 PM — While debate rages on over the value of nonrelational, or NoSQL, databases, two case studies presented at a New York conference this week point to the benefits of using the MongoDB non-SQL data store instead of a standard relational database.
Representatives from both The New York Times and social networking service Foursquare, speaking at the MongoNYC conference held Wednesday in New York, explained why they used MongoDB. They praised MongoDB's ability to scale up and ingest lots of data, as well as its ease of reconfiguration.
"SQL databases have grown into these weird monstrosities. They don't really map to the problems you actually have, so you try to work around their warts," said Harry Heymann, the Foursquare engineer who oversees the company's servers, during his presentation. "MongoDB is a practical database for problems that engineers in the real world have. It was developed by people who built large-scale Web apps."
For The New York Times, MongoDB has been "awesome for flexible research and development," said Jake Porway, a New York Times data scientist, in his talk. Porway works for the news organization's research and development group, which looks at ways digital technology can enhance the presentation of news.
Porway also praised MongoDB's ability to ingest large amounts of data. "Mongo eats this data up," he said.
The New York Times used MongoDB, the open-source NoSQL data store developed by 10Gen, for its experimental Cascade data visualization tool.
Cascade visually demonstrates how links to New York Times stories get copied by multiple Twitter users, showing how messages get passed from one user to the next. "This was an exploratory tool that helps us understand how people share" information, Porway said.
Cascade depicts the number of people who pass a story link on to others, as well as how long it takes to pass this data around.
The New York Times posts 600 pieces of content every day, often putting links to those pieces of content on Twitter. Links to these stories get rebroadcast across Twitter an average of about 25,000 times a day, Porway said. The Cascade system saves all the Twitter messages, as well as the number of times each story link was forwarded and clicked on. All told, it produces about 100 GB of data each month.
"This allows us to [answer] questions that are really big, like what is the best time of day to tweet? What kinds of tweets get people involved? Is it more important for our automated feeds to tweet, or for our journalists?" Porway said.
The three-dimensional visualizations can show huge spikes in activities, which the user can then dig into to find more details, such as the actual messages.