Sadilek, Kautz and Silenzio conducted their work using the Twitter Search API, which allowed them to collect a sample of public tweets from the New York metropolitan area. They collected the tweets for a month beginning on May 18, 2010. They used a Python script to periodically query Twitter for all recent tweets within 100 kilometers of the city center, and they distributed the queries over a number of machines with different IP addresses that asynchronously queried the server to avoid exceeding Twitter's query rate limits. They merged the results and then concentrated on the 6,237 users who posted more than 100 GPS-tagged tweets during the month.
Once they had narrowed the population to users they could reliably follow geographically, they still needed to deal with class imbalance: health-related tweets are relatively scarce compared to other types of messages and so reliably classifying them is tricky. To do so, they trained two different binary SVM classifiers&-SVM is an established model of data in machine learning-to accurately distinguish between tweets that indicated the tweeter was sick and all other tweets. One SVM classifier was highly penalized for inducing a false positive (labeling a normal tweet as one about sickness), while the other was heavily penalized for creating a false negative (labeling a tweet about sickness as a normal tweet).
Part of that process involved weighting "features"-essentially keywords-to help the SVMs distinguish between "sick" and normal tweets. For instance, the feature "sick" in a message received a positive weight of 0.9579. However, the feature "sick of" received a negative weight of -0.4005, indicating a lower likelihood that the tweeter was ill.
At the other end, they were able to extract more than 700,000 "sick" messages. The researchers then studied the movements of the users who posted these messages, using their Twitter friendships to gain deeper insight into how the contagion spread:
"To quantify the effect of social ties on disease transmission, we leverage users' Twitter friendships," they wrote. "Clearly, there are complex events and interactions that take place "behind the scenes", which are not directly recorded in online social media. However, this work posits that these latent events often exhibit themselves in the activity of the sample of people we can observe. For instance, as we will see, having social ties to infected people significantly increases your chances of becoming ill in the near future."