The traditional approach in statistical modeling would be to first develop a theoretical model for the activity you’re studying. Such a model would require many assumptions such as, in the case of trending Twitter topics, who’s doing the tweeting about a topic (number of followers, Klout score, etc.), the way in which topics can become trending (e.g., a sharp increase in tweets about the topics vs a smoother growth rate), and the underlying statistical distribution of the model parameters.
Shah and Niklov’s approach doesn’t require any of that. Instead, it relies on the data to tell the story, by using sample data to “train” the algorithm (in the Twitter example, they sampled one month worth of tweets, covering 250 trending and 250 non-trending topics, as training data). Real world events can then be compared to the training data to determine the likelihood of an outcome. In their example, the key metric was the rate at which trending/non-trending topics were being tweeted about. Given the simplicity of the approach and the relatively small amount of training data, their success rate is quite remarkable.
While their method certainly could be valuable to Twitter itself (in fact, Nikolov is currently an intern at Twitter), it could also be applied to any number of other problems involving the recurrence of events over time. Given enough training data and computing power (distributed computing systems are key to making the data processing feasible), it could be used, for example, to predict jumps in stock prices, anomalies in clinical drug trials and traffic jams.
How confident is Shah in this method? “I would bet my life on this,” Shah said at a recent seminar to talk about this work that I attended.
Fascinating stuff. Now, if you’ll excuse me, I’m going to take a crack at implementing their methodology to predict the likelihood that either (or both) of my daughters will clean her room anytime in the near future.