Data snatchers! The booming market for your online identity
A huge, mostly hidden industry is raking in billions collecting, analyzing, sharing information you put on the Web. Should you be worried?
Make no mistake, your personal data isn't your own. When you update your Facebook page, "Like" something on a website, apply for a credit card, click on an ad, listen to an MP3, or comment on a YouTube video, you are feeding a huge and growing beast with an insatiable appetite for your personal data, a beast that always craves more. Virtually every piece of personal information that you provide online (and much that you provide offline) will end up being bought and sold, segmented, packaged, analyzed, repackaged, and sold again.
The "personal data economy" comprises a menagerie of advertisers, marketers, ad networks, data brokers, website publishers, social networks, and online tracking and targeting companies, for all of which the main currency--what they buy, sell, and trade--is personal data.
[ FREE DOWNLOAD: 68 great ideas for running a security department ]
Their databases pull user information from a long list of sources--everything from birth certificates to browsing history to Facebook "Likes"--and they're becoming better at finding patterns in the data that predict what you might do or buy in the future. A child born in 2012 will leave a data footprint detailed enough to assemble a day-by-day, even a minute-by-minute, account of his or her entire life, online and offline, from birth until death.
And the databases that collect this information are increasingly hyperconnected--they can trade data about you in milliseconds.
Facebook, to many, is the face of the personal data economy. Its entire business is aggregating the personal data that its users give at the site. Today, Facebook uses that mountain of personal data to help advertisers target ads on the Facebook site. However, as many observers have said, Facebook's investors are likely to pressure the now-public company to look for new ways to "monetize" its personal data.
"We're accepting more privacy intrusions each day, sometimes because we don't realize what we're giving out, other times because we don't feel we have a choice, other times because the harm of this isolated transaction seems so remote," says privacy attorney Sarah Downey, who works for personal data security products company Abine.
She adds, "Once collected, our data ends up in unexpected--and unwanted--places, and spam emails, inclusion in harmful information databases, and even identity theft can follow."
In the following pages I'll try to add to the personal data economy story by describing some of the latest trends in personal data collection and analysis--the combination of online and offline data, hyperconnectivity and real-time ad targeting, browser fingerprinting and tracking, and finally the new methods of analyzing huge databases of consumer information.
Combining Online and Offline Data
Personal data has become far easier to access and aggregate than it used to be. Long before we started cataloging our lives on the Internet, much of the information about us lived in hard-copy public records documents at the city hall or the county courthouse. Those public records, which include birth data, real estate records, criminal records, political affiliation and voting records, and more, have in recent years been scanned, digitized, and otherwise fed into databases. That data is now being combined with our online personal data.
A whole industry of public records data companies has sprung up to aggregate public records data from every city, county, and state in the union, and to make the data easily available online (for a price). Some of these firms, like Intelius.com and Spokeo, are combining public records data (originally created offline, in the physical world) with online data (information that we give out via the Internet), such as personal data from social networks.
Spokeo aggregates data taken from social media and networking sites, and it augments user profiles with public records data, the company's chief strategy officer, Emanuel Pleitez, tells me.
Intelius Inc., which owns Intelius.com and other "people search" sites, has begun augmenting its core public records data product by adding social network data to its user profiles. "It's an area we're moving in now," says Jim Adler, chief privacy officer and general manager of data systems at Intelius.
He adds, "Our job is to pull data together from whatever sources are available. If it's publicly available, we'll use it."
Today Intelius is capturing only the most basic information from Facebook, Twitter, and other social networks--names, ages, and where a person has lived. But many aggregators are just beginning to explore the uses of social networking data.
Data Combination Could Pose New Privacy Threats
What may be a dark side to this mashup of public records and social networking data is this: Public records sites such as Intelius, Spokeo, and PeopleFinders.com distribute the kind of data that landlords, insurers, employers, or creditors could easily use to screen applicants--but the sites insist that their content is not intended for such uses.
"The use of our service to screen potential employees, tenants, or for any other purpose that's restricted by the Fair Credit Reporting Act is in violation of our Terms & Conditions," Intelius's Adler wrote in an email to PCWorld.
But many people suspect that personal data offered at public records sites is being used for exactly such purposes. As FTC Commissioner Julie Brill has commented: "I have long been concerned about data that [is] used in place of traditional credit reports to make predictions that become a part of the basis for making determinations regarding a consumer's credit [and] his or her ability to secure housing, gainful employment, or various types of insurance."
And in truth, the public records sites would have no way of knowing if this happened--and may not want to know.
Add social networking info, and an employer or landlord could get a more nuanced (but potentially misleading) picture of a person. Here, data from two parts of a person's life is being accessed--public records, formal and open, and social networking data, informal and intended for "friends." An applicant for a job, a housing rental, or insurance would probably have no inkling of his or her social network data being accessed.
Combining Data for Political Targeting
High-tech targeting isn't just for selling products anymore. It's now being used to sell candidates and ideas.
Political campaigns are combining online and offline data to form a detailed picture of prospective voters, and looking for clues that a voter might be swung by a well-targeted ad. Campaigns from both major political parties are hiring political advertising and consulting firms like Aristotle, CampaignGrid, RapLeaf, and TargetedVictory, all of which have amassed personal and political data on millions of people.
This data is gleaned from voting information in the public record--party affiliation, and how often the person has voted over the years.
Firms can combine that offline political data with other offline data such as real estate records, and then combine that with a subject's online activities, such as social network profiles, online shopping histories, contributions to charities and political causes, and articles read (the types of articles you read say a lot about your political leanings--whether you're pro-guns or pro-choice, say).
Political campaigns will spend more than ever before on advertising in the 2012 elections, according to a report from Borrell Associates, and they will spend far more on online advertising than in the past. Campaigns will spend a total of $9.8 billion (much of it Super PAC money) in 2012, up from $7 billion in 2008, and online advertising spending will rise to $160 million in 2012 from $22 million in 2008.
Still, online political advertising remains in its infancy. While TV will get 57 cents of each advertising dollar spent on 2012 campaigns, online advertising will get only 1.4 cents, the Borrell Associates report says.
Online Advertising 3.0
The online advertising business is powered by personal information. In fact, the industry is being defined by an arms race to develop both new ways to collect more (and more accurate) personal data and better methods to track and analyze people's online choices and behaviors.
Virtually every player in the Web advertising business is sitting on a big database of personal data. Those databases contain the demographic, preference, and social data of millions and millions of Web consumers, and those databases are growing larger and larger all the time. Those databases have also become hyperconnected; that is, various players in the ad delivery chain can share the personal data in those databases in just milliseconds.
In the Web's early days, advertisers were content to place ads in front of people they knew little about in hopes that two or three in 100 would click on them. That "blind" ad-serving model is giving way to "smart" ad serving, where advertisers and their agencies work with intermediaries, and a lot of targeting data, to place ads in front of users likely to click on them.
They judge that likelihood by identifying a person visiting a website, and then evaluating the person's profile in a database, which might contain the person's browsing history, online buying habits, demographics, and even the likes and dislikes of their Facebook friends. After the target prospect has been identified, advertisers want to use the data in an effort to serve up a highly personalized ad to the target.
In short, advertisers are moving away from buying clicks, and toward buying "audiences," instead. The audiences are defined by commonalities in their personal data, gathered from many different sources, both online and off.
"[A]n arms race [is] going on in the data economy right now," says Shane Green, CEO of Personal.com, which offers a personal data management tool for consumers. "Everybody is working hard to find differentiated data, and differentiated analytics."
The quality and variety of what's in those databases makes all the difference in the success of an advertising campaign. "The companies that are able to use their data to best identify and serve ads to site visitors in real time will win," one advertising executive who chose to remain anonymous told me.
Real-Time Ad Targeting
Here is a radically simplified explanation of how an advertiser would place an advertisement on a website today:
When someone visits a website, that site has an opportunity to deliver a targeted ad on behalf of one or more of its advertisers. To do this in real time, the website posts the availability of an advertising opportunity on an "exchange"--a Web-based open market where advertisers can bid to deliver targeted ads.
But before the advertiser buys the opportunity to show its ad, it wants to know a lot more about the person who will view it. So it looks for a small bit of identifying code (an HTML cookie) that it has installed on the visitor's computer in the past. The advertiser then determines whether the cookie ID matches an audience profile in either its own database or that of one of its technology partners.
The profile databases within which the advertiser looks can contain information from hundreds of sources of offline and online data, and can be augmented with information bought from large data brokers such as Acxiom or Experian, or from specialty data brokers like 33Across and Media6Degrees, which sell profiles based on people's social networking data.
If the advertiser finds a match, it then determines how much to pay for the impression based on factors that may include demographics, time of day, or even how recently the visitor last saw one of its ads.
The advertiser might then work with another technology partner to adjust the content of the ad (anything from the messaging to the color of the product) to match the likely interests and tastes of the site visitor.
All of this happens in milliseconds.
Fingerprinting Tech: Data Aggregators' BFF
Using cookies to recognize people online and sync up data about them isn't ideal, however. A cookie associated with a particular IP address might contain the browsing histories of multiple people in the household who use that PC. And cookies may not last very long in the browser: Security software is often set to delete cookies once a week. People in the online advertising industry call such deletions "cookie erosion."
Naturally, companies are springing up with technologies that resolve these issues. New "fingerprinting" technologies rely on some highly sophisticated means to verify that the personal data collected at different sites at different times, and for different reasons, are all from the same consumer.
BlueCava, based in Irvine, California, has developed a "device ID" technology that identifies site visitors based on the unique combination of settings in their Web browser. The company then buys demographics, preference, and Web tracking data from site publishers all over the Web, and matches and adds that data to the identified users' profiles in its database. It can then sell all that profile data to advertisers and marketers. BlueCava CEO David Norris says that his company's technology can identify devices with 99.7 percent accuracy, and that it has already identified roughly 10 percent of the 10 billion Internet-connected devices in the world.
Fingerprinting Challenges Anonymity Online
Fingerprinting technologies like BlueCava's give some in the privacy community serious pause. "I think device ID is really unethical," says Kaliya Hamlin of the Personal Data Ecosystem Consortium. "It's one thing to put cookies in your browser, because you can throw them out; but a device ID is permanent, and takes away your means of defining context in your digital life."
Hamlin believes that device ID degrades privacy by taking away our ability to use alternate identities online to keep assorted aspects of our digital lives separate.
In the physical world, Hamlin points out, we can use physical distance and time to separate the various contexts in which we operate. We can get in the car and drive to our kids' school for a teacher conference, then drive across town to an AA meeting, and maybe participate in a hobby on the weekends. The info we give out in each of these contexts stays separate because we give it to different people at different places at different times.
But online, Hamlin notes, those firewalls just don't exist. Instead, to stay anonymous, people rely on various nicknames and avatars at the sites they frequent. But device ID defeats this practice. Device ID concerns itself with the device and the browser people use to access websites, not the identities they set up there. It ties all those identities together into one big profile.
"Device ID is almost like the police putting GPS trackers on cars, which the Supreme Court just ruled illegal [in United States v. Jones]," Hamlin says. The one difference is that a driver can remove a GPS tracker, but a device ID is established far away, so a computer user can't easily remove it.
BlueCava's Norris counters that his company will remove a device ID from its system if a consumer requests it at the company's website. Norris says that this accommodation is more privacy-promoting than Do Not Track for cookies, because, he says, Do Not Track cookies can easily be deleted in the browser (by the user or by antivirus software), but the deletion of a device ID is permanent.
The problem, however, is that most people will never even know that a device ID exists for them.
'Big Data' Analysis Infers a Lot From a Little
So-called Big Data is one of the few big concepts that will define technology and culture in the first part of the 21st century. The term refers to the capture, storage, and analysis of large amounts of data. This can mean any kind of data, but the term often refers to the collection and analysis of personal data.
Running deep analysis of terabytes of data was perhaps pioneered by Google, but Big Data practices are now in place at all kinds of organizations, from law enforcement to dating sites to UPS to Major League Baseball. IDC (owned by the same parent company as PCWorld) says that the $3.2 billion that companies spent on Big Data in 2010 will grow to $16.9 billion in 2015.
Among people involved in the personal data economy in one way or another, one anecdote comes up over and over again, and beautifully demonstrates both the possibilities and the dangers of Big Data.
A story by Charles Duhigg in the New York Times Magazine in February described how analysts in the predictive data department of Target developed a way to use the company's customer data to predict the pregnancies (and future baby product needs) of its female customers, sometimes even before the woman's family knew she was pregnant.
This was an extremely important discovery for Target because it allowed the company to show the women ads for various baby products timed to each phase of the pregnancy. There was an even bigger bonus. During the stressful months of pregnancy, future moms' and dads' normal buying habits frequently go out the window, and they look for the most convenient place to buy everything. If Target could get the women into its stores to buy baby products, it might become their go-to source for all sorts of products.
The Target analysts got their breakthrough by looking at the buying histories of women who had signed up for new baby registries at Target. The analysts noticed that pregnant women often bought large amounts of unscented lotion around the start of their second trimester, and that sometime during the first 20 weeks of their pregnancies they bought lots of supplements like calcium, magnesium, and zinc.
The analysts then searched for these same "markers" in all females of childbearing age, found the likely moms-to-be, and sent them offers and coupons for baby products carefully timed to the various stages of pregnancy. Ka-ching.
This is a relatively simple example, and one that happened to be reported in the media. But, as the Duhigg article points out, most large companies in America now have "predictive analysis" departments and are learning to look for the kind of markers that Target discovered hidden in its data.
Big Data Puts Privacy in a New Light
In the Target case, future parents were served with highly relevant ads and offers, and the retailer found a new way to reach its customers and pump up sales. No problem, right?
Wrong, say privacy advocates. The warehousing and analysis of so much data, and so many types of data, might lead the curators of the databases to infer things about us that we never intended to share with anybody. The data might even predict our future behaviors--what even we don't yet know that we're going to do.
The "predictive analysis" of Big Data is often called "inductive analysis" in academic and research circles because it induces large meanings from small sets of facts or markers.
"Inductive analysis concerns itself with singular things that can seem to be innocuous, but that when combined with other innocuous data points--like your favorite soda--can create meaningful predictors of behaviors," says Solon Barocas, a New York University graduate student who is working on a dissertation about inductive analysis.
Target, for instance, didn't even need to know the names of the women it ended up sending pregnancy ads to. It simply delivered a target ad to a group of addresses with the right demographics and a common pattern of past purchases. A process so totally cold and machinelike being used to predict something so human, so personal, like pregnancy, is creepy.
In the next ten years, marketers and advertisers will spend more and more on Big Data science, focusing on finding analysts who can discern patterns in large pools of data. Big Data analysis positions are the new hot jobs, and the people who will fill them are a new breed, with new skills. "These people need traditional statistics and computer-science backgrounds, but also some coding and basic hacking skills," Barocas says.
Big Data analysts don't just help target ads for products. A political campaign might do a survey of 10,000 people to learn about their demographics and political choices. It might buy more data about those people from one of the large data sellers, like Acxiom or Experian, then search for unique markers in the data that would predict future political leanings.
But those predictors may bear no obvious relation to what they predict, Barocas says. "For instance, the analysts might find that something odd--like what fashion-magazine subscription people hold--is a strong predictor of the kind of candidate they're likely to vote for."
In future elections and ballot initiatives, billions will be spent on making inferences about voters, and about the issues, candidates, and political ad content that they might be sympathetic to. The campaign with the best personal data and the best analysts may win. That seems like a very undemocratic way to choose our policies and leaders.
Experts say that in the future, predictive analysis will advance to the point where it can tease out information about people's lives and preferences using far more, and far more subtle, data points than were used in the Target case. The inductive models that some companies already use are huge, containing up to 10,000 different variables--each with an assigned weight based on its ability to predict.
But Big Data analysis may have a built-in public relations problem, because its way of predicting human behavior seems to have little to do with human behavior. Unlike traditional analysis, which seeks to predict future preferences or behaviors based on past ones, the field's inductive analysis concerns itself only with patterns in the numbers.
After Target "targeted" baby ads at women it thought were pregnant, the women and their families criticized the company's tactics. They were creeped out by the ads because Target's inference about them could not be mapped to any piece of data that they had already provided. Even though Target was correct in its inferences, it was simply not intuitive that the purchase of cotton balls and lotion would predict that the buyer was pregnant and would soon be buying diapers.
More than anything else, this new, mathematical method of analysis may force us to look at our privacy and the way we manage our personal data in a whole new light. After all, it's unsettling to know that hundreds of unrelated bits of our data can be pulled together from a hundred different sources (perhaps verified by fingerprinting technology like BlueCava's) and analyzed to reveal numeric patterns in our behavior and preferences.
"Even the smallest, most trivial piece of information might be strung together with other pieces of information in a pattern that is sufficient enough to infer something about you, and that's a challenging world to live in because it upsets our basic intuitions about discretion," Barocas says.
Transparency, Inclusion Might Help Everyone
When Target realized its baby-products ads were getting a negative response, it didn't pull the ads; instead, it elected to hide them among unrelated and less-targeted ads when showing them to pregnant women. Rather than asking female customers if they were interested in special offers for baby products, the company chose to infer the answer in secret.
And that lack of transparency may be the single biggest objection to consumer tracking and targeting today. Advertisers are spending millions to combine, transmit, and analyze personal data to help them infer things about consumers that they would not ask directly. Their practices with regard to personal data remain hidden, and they're acceptable only because people don't know about them.
Such tracking and targeting also feels arrogant. Consumers may not mind being marketed to, but they don't want to be treated as if they were faceless numbers to be manipulated by uncaring marketers. Even the term "targeting" betrays a not-so-friendly attitude toward consumers.
Ironically, advertisers might be far more successful if they pulled back the curtain and included consumers in the process. It's well known that the personal data in the databases of marketers and advertisers is far from completely accurate.
Maybe, as several people I talked to for this story pointed out, the best way to collect accurate data about consumers is to just ask them. And if an advertiser is hesitant to ask for a certain piece of personal data, the advertiser shouldn't infer it.
"What our organization is trying to work out is whether or not there's a way to [collect personal data] where the user knows what's happening and companies [get] their data not by stalking [users] but by asking them," says the Personal Data Ecosystem Consortium's Hamlin.
It might sound something like this, Hamlin says: "You tell us your income and your age and some of your interests, and we promise to use this information to present you with relevant content, [such as] an ad that matches your interests."
Internet Needs to Grow Up
Still, many people--on both the privacy and advertising sides of the fence--believe there is room both for consumer privacy and for Web advertisements and content targeting using personal data. But the veil of secrecy around the use of personal data would have to be lifted.
For that to happen, many believe, everybody in the personal data economy must be more realistic about the economics of the Internet. Advertising, in one form or another, pays the bill for all things free online. Everything that website publishers, content creators, and app developers give away online is paid for with advertising--advertising that is targeted by using consumers' personal data.
Consumers are complicit in the growth of the personal data economy because we have come to expect lots of free services online. From the Internet's earliest days, we've always expected a level of anonymity--but the more free services we use, the more personal data we must give away, and the less privacy and control over our data we have. It's up to us to find our own comfort zone between those two ideals, but we need information and transparency to make that choice.
The online advertising industry needs to become much more transparent about the ways it collects and uses our personal data. If it did so, we might be more inclined to believe its claim that carefully targeted ads actually help us by making Web content more relevant and less spammy.
If a website publisher or social network is offering a "free" service in exchange for the user's personal data, the site should be very clear about that exchange. The online advertising industry should give people options--a choice between "free and tracked" or "paid and not tracked," for instance. That idea is nothing new; it's very similar to the free, ad-based services that also offer an ad-free premium service.
It's not a zero-sum game, where either privacy or targeting wins outright. Advertisers won't stop using personal data to target ads. And few consumers will quit using Facebook or other sites that collect personal data after they read this article. We can't expect complete privacy and anonymity online, but advertisers and marketers must understand where we expect privacy.
The challenge now is for everyone involved--consumers, advertisers, Internet companies, and regulators--to understand how the personal data economy really works.
Only then can we start getting busy developing some rules of the road that balance the business needs of advertisers with the privacy needs of consumers.