Big data, metadata, and traffic analysis: What the NSA is really doing
The NSA doesn't have to intercept and read all your messages to know what you're doing -- and neither do many Internet businesses.
What I find most remarkable about all the hubbub about the National Security Agency's Prism program is how little new "news" there is to Edward Snowden's "revelations."
After all, the NSA's mission has been to intercept communications and break codes ever since it was founded in 1952. Combine that with the Patriot Act, and anyone who's bothered to read the books of NSA expert James Bamford over the last few years won't find anything in the least bit surprising about Prism.
It's possible, of course, that the NSA is doing something technically interesting, like intercepting and breaking SSL-protected Internet communications. But the NSA doesn't have to bother with deciphering your PGP-protected love notes to your sweetie to know what you're up to. No, they can combine their age old techniques of working with metadata and traffic analysis with 21st century big data analysis to have a darn good idea of what you, along with everyone else, are doing.
It's not just the NSA, though. Big Internet businesses have been using the same techniques to deliver customized Web experiences to you for almost twenty years.
Metadata
It's metadata that gives anyone with access to your data, not just the NSA, the ability to work out what you're up to even if your data is locked up and encrypted.
Unless you're a serious photography, video, or music collector, you may not know about metadata. It's "data about data" -- or, more properly in this context, it's data about content. When you look at a Web page, a photo, or an e-mail message, what you see is the human-readable content. Hiding underneath that picture of a kitten, the ITworld Web page, or a note from your mom, is all kinds of data about what you see.
With a digital photograph, there can be dozens of data fields. There are multiple formats for this data. The most popular are Exchangeable Image File Format (EIFF), International Press Telecommunications Council (IPTC), and Adobe's Extensible Metadata Platform (EMP).
A photograph's metadata can record the camera that was used to take it, and the date and time it was taken -- along with the location, if the camera has a GPS. If you edit your photograph, the metadata can also be used to record what software and operating system you used. And with the right software, or even a Website like exifdata, you can read any image's metadata.
Web pages are the same way. You probably know about cookies and your Web browser history, but there's far more data available out there about your Web interactions than you might think.
For example, when you user Twitter, a host of metadata about each of your tweets is preserved in JavaScript Object Notation (JSON). This data can, in turn, be used by others, including companies such as Gnip, which specializes in analyzing social-networking metadata for enterprises. How much data? There's the stuff that's obvious, such as your Twitter ID and the time and date you sent the tweet, but there's also additional metadata, such as your location and the program and device you used to send the tweet. So it is that Gnip and MapBox can create maps of smartphone users for any given location. Is that you in the upper right?
Welcome to Manhattan, which, unlike most places, still has some BlackBerry users contending with many iPhone tweeters and Android doing well in the suburbs. Is that you in the upper right corner?
If you think that's bad, consider all the information that the MIT Media Lab Immersion program can pull up about you from just the From:, To:, CC: and Timestamp fields of the messages in your Gmail account. Stunning isn't it? When you take a closer look at in the traffic analysis, you'll see it's actually far more revealing than it looks at a casual glance.
You don't need to be a traffic analysis expert to figure out who I interact with -- you just need four fields from my Gmail messages. That small square of blue dots at the lower left is ITworld.
Not worried? Think you can dodge around e-mail tracking with a few simple tricks? That's what former CIA director David Petraeus thought -- and he was wrong, wrong, wrong. Petraeus and his mistress Paula Broadwell used Gmail to communicate, but never actually sent messages to each other. Instead, they used anonymous email accounts to leave drafts of messages for the other to read. Safe? Anything but.
While they did avoid the common mistake of using their home Internet accounts, Broadwell, at least, logged into the various mail accounts from public hotel Wi-Fi networks. From there, it was simply a matter of collating guest lists from various hotels, IP login records, and, eventually, it appears, access to the actual drafts.
So it was that the head of the CIA itself was brought down by an FBI investigation of anonymous e-mails. Do you think you can do better? I doubt it.
Traffic analysis
With traffic analysis, you're not looking at metadata so much as at communication patterns. Take that Gmail diagram Immersion created. What does it tell you? Well, even at a casual glance you can see there are clusters of people I communicate a lot with. Would it surprise you know that, as a technology journalist, there's a direct relationship between each cluster and a particular publication? It shouldn't. The square made up of blue dots at the lower left, for example, is ITworld.
It's not just who you talk to and who speaks to you that can provide hints about what's going on. The pattern of communications matters as well. For example, if you have one person in an organization who frequently sends messages to a large number of people, but who receives relatively few messages back, chances are they're a leader. Are several people in a group who haven't previously been e-mailing or IMing each other suddenly talking to each other? It's a good bet they've been assigned to a joint project or team.
Think about it. Say you know only that Steven tends to e-mail or IM Esther, Jodie, and Amy between 9 a.m. and 5 p.m. from a single IP address, Monday to Friday. What do you think the odds are that he works with them and that he's contacting them from his office? Pretty high, wouldn't you think?
And, of course, with the IP address, thanks to Internet geolocation services like IPlocation, anyone can work out, generally speaking, where Steven's office is. With just two data fields -- time of messages sent and IP address -- you can work out someone's work hours and where their office is.
This is a trivial example. Every day, as noted above when we discussed metadata, you're providing your ISP and favorite Websites -- and, oh yes, the NSA -- with far more data.
Back in the mid-2000s, the NSA was using Narus Semantic Traffic Analyzer, a Linux-based software program, to surveil American Internet traffic. With this deep packet inspection tool, the NSA was able to track who was sending what kind of traffic to whom at a rate of 10 gigabits of IP packets or 2.5 gigabits of Web traffic or email, per second.
That was eight years ago. Think about it.
Big data
Take all that traffic, take all that metadata, and what do you have? You have exabytes of data. Google's Eric Schmidt said in 2010, "There was 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days, and the pace is increasing ... People aren't ready for the technology revolution that's going to happen to them."
Well, people in general may not be, but the NSA has been working hard on it. In the obscure Utah town of Bluffdale, the NSA is building the blandly-named Utah Data Center. In this million-square-foot data storehouse, the NSA will be keeping its -- and your -- data.
It takes more than massive amounts of storage though to make big data usable. It takes software, but thanks to programs such as Hadoop, Hive, NoSQL, and Scala, we're getting there.
Hadoop, for example, is an open-source software project that enables the distributed processing of large data sets across clusters of commodity servers. As IBM puts it, Hadoop is:
-
Scalable: New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
-
Cost effective: Hadoop brings massively parallel computing to commodity servers. The result is a sizable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
-
Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
-
Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
What all this means is that the NSA can use Hadoop, or a similar program to take in huge amounts of data of all kinds and sorts, store it cheaply, and immediately get to work on it. With sufficient computing power, real-world data could be analyzed in close to real time.
Maybe the NSA has supercomputers for big data mining, but with programs like Apache Drill and ordinary servers, you too can hunt down answers hidden in petabytes of data in seconds.
Source: Apache.org
The NSA isn't the only one working on this kind of speedy processing of massive data sets. Apache Drill is a relatively new open-source project that's building a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel research, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. Companies like IBM, HP, and Teradata are already making hundreds of millions of dollars helping customers like GE, Walmart, and Wells Fargo extract useful business information from petabytes of what once seemed like unrelated, even irrelevant data.
Still other companies, like Facebook, Google and Microsoft, use every bit of your data that comes their way from your use of their search engines and services to present you with customized ads. They've been using the triad of big data, traffic analysis, and metadata to make our Web experience more engaging for over a decade.
Good-bye privacy
Put it all together and what do you get? You get a world where even if the NSA isn't actually looking at your Internet messages' content or listening to your phone calls, they can already find out a vast amount about you, whenever they want.
In the meantime, all the major Web companies are already doing the same things. We traded our privacy for the convenience of a customized Web experience years ago.
It's not just that the NSA has long been looking into our affairs that we've been oblivious to for years, it's that we gave up our privacy to businesses ages ago as well.
Ready or not, like it or not, welcome to the 21st century and the death of privacy.