How smart crowds are solving big data problems

Businesses and government agencies are using data science crowdsourced sites, such as Kaggle, to solve real problems.

By , ITworld |  Big Data

Featured Kaggle competitions, March 2013

Image credit: Kaggle.com

Holding contests for improving data science models is no longer news, thanks in large part to Kaggle and several of its competitors. But what is changing is the nature of how private businesses and government agencies are interacting with the growing data science community, and how these projects are being used to further their own operations. Companies as diverse as Allstate Insurance, Microsoft, GE, GM and NASA have run prominent contests with positive results.

The contests are a way to bring outside and fresh perspectives to a thorny business problem, attract attention and new talent, and also provide some excitement in some pretty nerdy areas that normally don't get front-page headlines.

Kaggle has been in business for several years, and now has a roster of more than 80,000 scientists who have entered close to 200 different contests. Each contest maintains a leaderboard and the site tracks the overall history of each entrant. Some have assumed rock star status in the data science world (we'll get to them in a moment).

In some cases, the contests are used as a mechanism to hire the best and the brightest engineers. This is what Facebook did with two different contests: one looking at how to map graphs of the TCP/IP Autonomous Systems' network status figures, the other to recommend missing links in a social network. The latter contest got more than 400 people to post 3,500 different entries.

Security vendor Impermium sponsored another contest. They looked at being able to predict when someone would say something insulting in an online forum. They were trying to "identify new ways to defend against malicious language and social spam online, and help clean up the web by scrubbing away unwanted obscenities from user-generated content." Not surprisingly, one result from the competition found out that people tend to be most abusive between 9:00 pm and 10:00 pm. They had 50 entries. The prize was $10,000 along with an opportunity to interview for a job at the company. While they ultimately did not hire anyone, "the Kaggle competition was useful and we were able to examine many interesting algorithms," said their CEO Mark Risher via email. "These algorithms, even those that didn't win, gave insight into emerging fields of research, and even more importantly, helped ensure against tunnel vision, considering the problem from a fresh perspective."

Many sponsors from the biggest companies ask for a privately held contest. "This is because some of their data is too sensitive to be public," Kaggle CEO Anthony Goldbloom said.

Two separate Microsoft Xbox contests were held last year. The winning team received $5,000 in each case to produce an algorithm that can analyze a series of gestures correctly in a series of Kinect videos. Ford ran another contest where they asked people to predict driver awareness using a series of data points collected from the cars. Here is one explanation of how one of the top finishers went about crafting their entries.

One of the more unusual contests was set up by NASA to look at the effect of dark matter across the universe. It ran a few years ago, and gave out a $3,000 prize and had 72 participating teams. One entry early on was from a doctoral student in glacier science named Martin O'Leary. His entry "outperformed the state-of-the-art algorithms most commonly used in astronomy for mapping dark matter," according to a post on the White House blog. The post stated that O'Leary "encouraged people who usually focus on problems unrelated to the question at hand to apply their problem-solving skills to analogous problems in other fields. So it is that the study of glaciers on Earth has now deepened our understanding of the cosmos." Eventually, O'Leary finished fourth, refining his original algorithm 22 times. The winning team was a physics doctoral student and his professor from the University of California at Irvine.

The pharma company Boehringer-Ingelheim used a Kaggle contest last year as a way to validate machine learning against existing academic research. They picked a well-known data set about cellular drug interaction and kept the nature of the data hidden to make it more inclusive for teams who didn't have any biotech domain expertise. They were "thrilled to see the number of teams," said David Thompson, the director of organizational engagement for the company. Last spring the contest attracted 703 teams who submitted more than 8,000 entries, and at the time, it was the third-most popular contest. "We are likely to have reached the theoretical limit of what the current state of machine learning can say about this data set, in the complete absence of domain-specific knowledge," he said. More than half of the entrants were Kaggle first-timers. Since that competition, another pharma company, Phizer, has held a private competition to predict future prescription volume.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness