What Google knows about you

"Google knows more about you than your mother."

Kevin Bankston, senior staff attorney at the Electronic Frontier Foundation, recently made that statement to this reporter. A few years ago, it might have sounded far-fetched. But if you're one of the growing number of people who are using more and more products in Google's ever-expanding stable (at last count, I was using a dozen), you might wonder if Bankston isn't onto something.

It's easy to understand why privacy advocates and policymakers are sounding alarms about online privacy in general -- and singling out Google in particular. If you use Google's search engine, Google knows what you searched for as well as your activity on partner Web sites that use its ad services. If you use the Chrome browser, it may know every Web site you've typed into the address bar, or "Omnibox."

It may have all of your e-mail (Gmail), your appointments (Google Calendar) and even your last known location (Google Latitude). It may know what you're watching (YouTube) and whom you are calling. It may have transcripts of your telephone messages (Google Voice).

It may hold your photos in Picasa Web Albums, which includes face-recognition technology that can automatically identify you and your friends in new photos. And through Google Books, it may know what books you've read, what you annotated and how long you spent reading.

Technically, of course, Google doesn't know anything about you. But it stores tremendous amounts of data about you and your activities on its servers, from the content you create to the searches you perform, the Web sites you visit and the ads you click.

Google, says Bankston, "is expecting consumers to trust it with the closest thing to a printout of their brain that has ever existed."

How Google uses personal information is guided by three "bedrock principles," says Peter Fleischer, the company's global privacy counsel. "We don't sell it. We don't collect it without permission. We don't use it to serve ads without permission." But what constitutes "personal information" has not been universally agreed upon.

Google isn't the only company to follow this business model. "Online tools really aren't free. We pay for them with micropayments of personal information," says Greg Conti, a professor at the U.S. Military Academy at West Point and author of the book Googling Security: How Much Does Google Know About You? But Google may have the biggest collection of data about individuals, the content they create and what they do online.

It is the breathtaking scope of data under Google's control, generated by an expanding list of products and services, that has put the company at the center of the online privacy debate. According to Pam Dixon, executive director at the World Privacy Forum, "No company has ever had this much consumer data" -- an assertion that Google disputes.

Opacity vs. transparency

Critics say Google has been too vague in explaining how it uses the data it collects, how it shares information among its services and with its advertisers, how it protects that data from litigators and government investigators, and how long it retains that data before deleting or "anonymizing" it so that it can't be tracked back to individual users.

"Because of Google's opacity as to how it is using that data, and a lack of fundamental information rights [that] users have, [privacy] becomes a very thorny question," says Dixon.

Privacy policy opacity isn't limited to Google. It's so prevalent, in fact, that the Federal Trade Commission warned the industry in February that online businesses will face increased regulation unless they produce privacy statements that explain in a "clear, concise, consumer-friendly and prominent" way what data the companies collect, how they use it and how users can opt out (download PDF).

Google, however, contends that the concerns about opacity and the scope of data it collects are overblown. "I do push back on this notion that what we have is a greater privacy risk to users," says Mike Yang, product counsel in Google's legal department. Google, he says, gives users plenty of transparency and control. "There's this notion that an account has a lot more information than is visible to you, but that tends not to be the case. In most of the products, the information we have about you is visible to you within the service."

In fact, though, the data Google stores about you falls into two buckets: user-generated content, which you control and which is associated with your account; and server log data, which is associated with one or more browser cookie IDs stored on your computer. Server log data is not visible to you and is not considered to be personally identifiable information.

These logs contain details of how you interact with Google's various services. They include Web page requests (the date, the time and what was requested), query history, IP address, one or more cookie IDs that uniquely identify your browser, and other metadata. Google declined to provide more detail on its server log architecture, other than to say that the company does not maintain a single, unified set of server logs for all of its services.

Google says it won't provide visibility into search query logs and other server log data because that data is always associated with a physical computer's browser or IP address, not the individual or his Google account name. Google contends that opening that data up would create more privacy issues than it would solve. "If we made that transparent, you would be able to see your wife's searches. It's always difficult to strike that right balance," Yang says.

You do have more control than ever before. Google says it removes user-generated content within 14 days for many products, but that period can be longer (it's 60 days for Gmail). For retention policies that fall "outside of reasonable user expectations or industry practice," Google says it posts notices either in its privacy policy or in the individual products themselves.

You can control the ads that are served up, either by adding or removing interest categories stored in Google's Ads Preferences Manager or by opting out of Google's Doubleclick cookie, which links the data Google has stored about you to your browser in order to deliver targeted advertising. For more information, see "6 ways to protect your privacy on Google."

Shuman Ghosemajumder, business product manager for trust and safety at Google, says users have nothing to worry about. All of Google's applications run on separate servers and are not federated in any way. "They exist in individual repositories, except for our raw logs," he says. But some information is shared in certain circumstances, and Google's privacy policies are designed to leave the company plenty of wiggle room to innovate.

Yang points to Google Health as an example. If you are exchanging messages with your doctor, you might want those messages to appear in Gmail or have an appointment automatically appear in Google Calendar, he says.

Google is hoping that what it lacks in privacy policy clarity, it can make up for in the transparency of its services.

But Dixon, who follows medical privacy issues, contends that they aren't transparent enough. Medical records, once transferred to Google Health, aren't protected by HIPAA or by the rules of doctor-patient confidentiality. Google states that it has no plans to use Google Health for advertising. But by sharing data across services, the company is blurring the lines, Dixon says.

If you have a health problem and you use Google Health, research the disease using Google's search engine, use Gmail to communicate with your doctor, and link appointment details to Google Calendar, and your last location in Latitude was a medical clinic, Dixon asks, "What does the advertiser get to know about you? What about law enforcement? Or a civil litigant? Where are the facts? I don't have them, and that bothers me."

Change in behavior

Google's recent decision to change gears and mine what it knows about you to better target advertisements has also raised concerns.

Until recently, Google placed ads based on "contextual targeting" -- derived from the subject of a search or a keyword in a Gmail message you were reading, for example. To avoid creeping people out with ads targeting sensitive subjects, it avoids the topics of race, religion, sexual orientation, health, political or trade union affiliation, and some sensitive financial categories.

With the information at its disposal, Google could pull together in-depth profiles of its users and launch highly targeted ads based on who you are (your user profile) and your activity history on the Web. The latter is a controversial practice known as behavioral advertising. Until recently, Google rejected the technique.

Then, on March 11, Susan Wojcicki, vice president of product management, announced in a post on Google's official blog that the company was taking a step in that direction. With the launch of "interest-based advertising," Google is beginning to target ads based not just on context but on the Web pages you previously viewed.

That Web page history will come from a log associated with the cookie ID. However, since that ID links not to a unique user but to a unique browser, you may end up viewing ads for Web pages visited by your spouse or others who share your machine. In a bizarre Catch-22, advertisers will be able to target ads at you based on logs that Google says it cannot make available to you -- for privacy reasons.

Ghosemajumder acknowledges that the situation isn't perfect. "In some cases there is [transparency], and in some cases there isn't," he admits. But he says Google is "trying to come up with more ways to offer transparency."

Privacy advocates fear that interest-based advertising is just the first step toward more highly targeted advertising that draws upon everything Google knows about you. "This is a major issue, because Google has been collecting all of this information over time about people and they said they would not be using that data," says Nicole Ozer, technology and civil liberties policy director at ACLU of Northern California.

But privacy advocates say Google is also doing some things right, such as launching its online Privacy Center and providing additional controls for some of its services.

Google is not acting alone in moving toward behavioral advertising. It is simply joining many other companies that are pursuing this practice. Mike Zaneis, vice president of public policy at the Internet Advertising Bureau, acknowledges that highly targeted advertising can be creepy. But, he says, "creepiness is not in and of itself a consumer harm."

The practice is unlikely to change unless users respond by abandoning services that use the techniques. But he argues that they won't because highly targeted ads are of more interest to users than nontargeted "spam ads."

Concerns have also been raised about Google's ability to secure user content internally. Google has had a few small incidents, such as when it allowed some Google Docs users' documents to be shared with users who did not have permission to view them. But that incident, which affected less than 1% of users, pales in comparison to security fiascoes at Google's competitors, such as AOL's release of search log data from 650,000 users in 2006.

Ghosemajumder says the privacy of user data is tightly controlled. "We have all kinds of measures to ensure that third parties can't get access to users' private data, and we have internal controls to ensure that you can't get access to data in a given Google service if you're not part of the team," he says.

How anonymous?

Bowing to pressure, Google has made other concessions as well.

Google doesn't delete server log data, but it has agreed to anonymize it after a period of time so that the logs can't be associated with a specific cookie ID or IP address. After initially agreeing in 2007 to anonymize users' IP addresses and other data in its server logs after 18 months, it announced last September that it was shortening that period to nine months for all data except for cookies, which will still be anonymized after 18 months. "All of our services are subject to those anonymization policies," says Ghosemajumder.

Critics complain that Google doesn't go far enough in how it anonymizes personally identifiable data. For example, Google zeroes out the last 8 bits of the 32-bit IP address. That narrows your identity down to a group of 256 machines in a specific geographic area. Companies with their own block of IP addresses also may be concerned about this scheme, since activity can easily be associated with the organization's identity, if not with an individual. Even anonymized data can be personally identifiable when combined with other data, privacy advocates say.

Sensing an opportunity, and facing similar criticisms, competitors have tried to go Google one better. Rather than anonymizing IP addresses, Microsoft deletes them after 18 months and has proposed that the industry anonymize all search logs after six months. Yahoo anonymizes search queries and other log data after three months, and the Ixquick search engine doesn't store users' IP addresses at all.

1 2 Page
Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies