MIT algorithm can predict which Twitter topics will trend
MIT researchers have developed a new machine learning algorithm to predict the likelihood of events occurring over time
If you're interested in predicting what topics will be trending on Twitter, there are several options to consider:
- Call a psychic hotline
- Bribe someone at Twitter who can give you a heads up when a topic is heading in that direction
- Get a couple of MIT researchers to write a machine learning algorithm to do just that
The first two options may not really work. Most of us (I hope) wouldn’t trust the first. As for the second, even if you knew the right someone at Twitter, they still may not be able to tell you much ahead of time when a topic is about to trend.
The good news is, the third option has already been done. Devavrat Shah, a professor in the electrical engineering and computer sciences department, and Stanislav Nikolov, a graduate student, have come up with a new statistical method that can be used to predict the likelihood of events occurring over time. As a proof of concept, they applied their new method to predicting which topics will become trending on Twitter.
The results? Using real tweets, they were able to correctly predict trending topics with 95% percent accuracy. 79% of the time they predicted trending topics before they became trending on Twitter, with an average lead time of almost 90 minutes.
While these results are impressive, what’s more interesting is their methodology, which is a non-parametric approach to statistical modeling. "Awesome," you say, followed by, "What does that mean?"
The traditional approach in statistical modeling would be to first develop a theoretical model for the activity you’re studying. Such a model would require many assumptions such as, in the case of trending Twitter topics, who’s doing the tweeting about a topic (number of followers, Klout score, etc.), the way in which topics can become trending (e.g., a sharp increase in tweets about the topics vs a smoother growth rate), and the underlying statistical distribution of the model parameters.
Shah and Niklov’s approach doesn’t require any of that. Instead, it relies on the data to tell the story, by using sample data to “train” the algorithm (in the Twitter example, they sampled one month worth of tweets, covering 250 trending and 250 non-trending topics, as training data). Real world events can then be compared to the training data to determine the likelihood of an outcome. In their example, the key metric was the rate at which trending/non-trending topics were being tweeted about. Given the simplicity of the approach and the relatively small amount of training data, their success rate is quite remarkable.
While their method certainly could be valuable to Twitter itself (in fact, Nikolov is currently an intern at Twitter), it could also be applied to any number of other problems involving the recurrence of events over time. Given enough training data and computing power (distributed computing systems are key to making the data processing feasible), it could be used, for example, to predict jumps in stock prices, anomalies in clinical drug trials and traffic jams.
How confident is Shah in this method? “I would bet my life on this,” Shah said at a recent seminar to talk about this work that I attended.
Fascinating stuff. Now, if you’ll excuse me, I’m going to take a crack at implementing their methodology to predict the likelihood that either (or both) of my daughters will clean her room anytime in the near future.