How smart crowds are solving big data problems
Businesses and government agencies are using data science crowdsourced sites, such as Kaggle, to solve real problems.
Image credit: Kaggle.com
Holding contests for improving data science models is no longer news, thanks in large part to Kaggle and several of its competitors. But what is changing is the nature of how private businesses and government agencies are interacting with the growing data science community, and how these projects are being used to further their own operations. Companies as diverse as Allstate Insurance, Microsoft, GE, GM and NASA have run prominent contests with positive results.
The contests are a way to bring outside and fresh perspectives to a thorny business problem, attract attention and new talent, and also provide some excitement in some pretty nerdy areas that normally don't get front-page headlines.
Kaggle has been in business for several years, and now has a roster of more than 80,000 scientists who have entered close to 200 different contests. Each contest maintains a leaderboard and the site tracks the overall history of each entrant. Some have assumed rock star status in the data science world (we'll get to them in a moment).
In some cases, the contests are used as a mechanism to hire the best and the brightest engineers. This is what Facebook did with two different contests: one looking at how to map graphs of the TCP/IP Autonomous Systems' network status figures, the other to recommend missing links in a social network. The latter contest got more than 400 people to post 3,500 different entries.
Security vendor Impermium sponsored another contest. They looked at being able to predict when someone would say something insulting in an online forum. They were trying to "identify new ways to defend against malicious language and social spam online, and help clean up the web by scrubbing away unwanted obscenities from user-generated content." Not surprisingly, one result from the competition found out that people tend to be most abusive between 9:00 pm and 10:00 pm. They had 50 entries. The prize was $10,000 along with an opportunity to interview for a job at the company. While they ultimately did not hire anyone, "the Kaggle competition was useful and we were able to examine many interesting algorithms," said their CEO Mark Risher via email. "These algorithms, even those that didn't win, gave insight into emerging fields of research, and even more importantly, helped ensure against tunnel vision, considering the problem from a fresh perspective."
Many sponsors from the biggest companies ask for a privately held contest. "This is because some of their data is too sensitive to be public," Kaggle CEO Anthony Goldbloom said.
Two separate Microsoft Xbox contests were held last year. The winning team received $5,000 in each case to produce an algorithm that can analyze a series of gestures correctly in a series of Kinect videos. Ford ran another contest where they asked people to predict driver awareness using a series of data points collected from the cars. Here is one explanation of how one of the top finishers went about crafting their entries.
One of the more unusual contests was set up by NASA to look at the effect of dark matter across the universe. It ran a few years ago, and gave out a $3,000 prize and had 72 participating teams. One entry early on was from a doctoral student in glacier science named Martin O'Leary. His entry "outperformed the state-of-the-art algorithms most commonly used in astronomy for mapping dark matter," according to a post on the White House blog. The post stated that O'Leary "encouraged people who usually focus on problems unrelated to the question at hand to apply their problem-solving skills to analogous problems in other fields. So it is that the study of glaciers on Earth has now deepened our understanding of the cosmos." Eventually, O'Leary finished fourth, refining his original algorithm 22 times. The winning team was a physics doctoral student and his professor from the University of California at Irvine.
The pharma company Boehringer-Ingelheim used a Kaggle contest last year as a way to validate machine learning against existing academic research. They picked a well-known data set about cellular drug interaction and kept the nature of the data hidden to make it more inclusive for teams who didn't have any biotech domain expertise. They were "thrilled to see the number of teams," said David Thompson, the director of organizational engagement for the company. Last spring the contest attracted 703 teams who submitted more than 8,000 entries, and at the time, it was the third-most popular contest. "We are likely to have reached the theoretical limit of what the current state of machine learning can say about this data set, in the complete absence of domain-specific knowledge," he said. More than half of the entrants were Kaggle first-timers. Since that competition, another pharma company, Phizer, has held a private competition to predict future prescription volume.
Kaggle has a separate service called Connect. Eight current customers, according to Goodbloom pay anywhere from $30,000 to $100,000 a month. It combines the world’s top data scientists with tools developed to provide corporate customers with the best analytic solution possible. Participants in each project sign non-disclosures and work within a private area on virtual machines that aren't available to anyone else, and can't move the data to any outside destinations. You can come with a specific business problem to solve or with an unexplored data set to extract actionable insights. Goldbloom says, "The same people keep performing well irrespective of the problem. It's this fact that allows us to do private gigs, because we can reliably identify who will make the best fit."
So who are some of the Kaggle rock stars? They are from all over the world. I interviewed two that coincidentally have day jobs in the financial services industry. In second place overall is Jason Tigg, a British physicist who looks at trading statistical arbitrage. He has entered 14 competitions, and won a few of them. He is motivated not by the prize purses but by learning new machine learning techniques. "I feel a buzz around the area, which I imagine was how physics felt around the turn of the last century. People are trying out new ideas and no one knows for sure where we will all end up." He got his start with the first Netflix prize and was hooked. (Netflix held their own data science contest to improve their own algorithms to recommend movies to their members.) He told me via email that "there is a lot of trial and error during the process" to refine his entries.
Another top finisher (and currently third overall) is Olexandr Topchylo who works in developing trading strategies for financial markets from the Ukraine, and who also has physics and mathematics doctorates. "Contests are an ideal way to compare quality of your algorithms and your abilities against other analysts," he emailed me. He has participated in nine different contests, including one of the Facebook recruiting competitions where he came in seventh but hadn't yet had an opportunity to interview with them.
"Every year I take part in the Automated Trading Championship. As opposed to Kaggle contests, here a participant has not only to devise an algorithm for predicting some values but he has also to develop a program which will be working online for three months on the organizers' server without human intervention. The code has to make virtual trades on real exchange rates. I took part in all five championships, and one time I even managed to win!"
Besides Kaggle and the trading contest, there are plenty of other places to start your machine learning competitions, including India-based CrowdAnalytix.com, Innocentive.com for the life sciences and TunedIT.org mainly for education and research projects.
How to host a successful Kaggle contest
If you are interested in hosting a competition, you must have a data set that you can scrub personal information to use for the competition and a budget for your prize purse. You fill out an entry form on the Kaggle website and their sales staff will consult with you to put the competition online. If a winner is declared (and there is usually a winner), the company pays the prize purse.
Companies who have hosted competitions share these tips for a successful contest:
Try to be as inclusive as possible. NASA and Boehringer have hosted contests, and kept the jargon and specific domain references to a minimum to encourage entrants from fields other than cosmology and biotech.
Prepare your data into both a training set (that will be used to prove the initial models built by the contestants), and a contest set (that isn't available to the contestants, but is used to score the winners).
Consider non-monetary incentives. Most of the Kagglers aren't doing this for the dough, but want the satisfaction of a job well done, or a chance to meet with your staff, or some other reward. For the NASA contest, the winners were invited to their labs to meet with their scientists.
Finally, use your own social media and email contacts to publicize the contest to assure the widest possible field.