Don't look now, but you've just been scraped

Data mining companies are scraping Web sites to gather your personal information and sell it to the highest bidder.

Everywhere you go on the Web, you leave breadcrumbs behind -- a comment here, a "like" there, a tweet, and so on.  Those tracks may one day come back to haunt you.

Today's reason to be paranoid: Wall Street Journal reporters Julia Angwin and Steve Stecklow's fascinating piece detailing the growth in "scraping" Web sites for information.

Widely known to Web savvy types but obscure to the general public, "scraping" involves using software to hoover up data off Web sites -- usually information posted in public forums or social networks -- and tuck it away into a database, usually for the purpose of selling it to someone else.

[ See also: What's wrong with Facebook's 'Group' grope ]

Companies scrape Web sites to find out what people are saying about their products, find people to sell products to, or figure out who to hire. Marginally legal, scraping is essentially stealing, even if the information is out in public for all to see. Worse, it can violate your privacy in a big way.

The WSJ zooms in on the case of a site called PatientsLikeMe, whose "mood" discussion boards were thoroughly scraped last May, violating the privacy of hundreds of users who posted information about their own personal struggles with mood disorders, including the medications they use.

The scraper in question? The Nieslen Company. Yes, that's right, the TV ratings people. They also operate several Net-centric data mining concerns, one of which pulls nasty sh** like this (or did, until its new CEO stopped the practice, shortly after PatientsLikeMe sent them a cease-and-desist nastygram).

Nielsen is hardly alone amongst the scrapers. Gleaning information from the InterWebs is becoming a big ticket business. To wit:

Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

And yes, the stuff you post on social networks like Facebook is also in the mix. The Journal quotes one low-rent scraping company that was hired to "scrape Facebook for a multi-level marketing company that wanted email addresses of users who 'like' the firm's page—as well as their friends—so they all could be pitched products."

Think about that the next time you decide to "Like" something on FB. Hope it's not an embarrassing personal hygiene product.

Most sites, including Facebook, actively try to thwart scraper bots, for the simple reason that they claim ownership over this information and want the option to sell it themselves to potential advertisers (usually in some aggregate or anonymized form). But they don't always succeed. The security geek who compiled profiles for 170 million Facebook users last July did it by creating software bots that scraped the site for user IDs.

Even if you post pseudonymously in most places (hard to do on Facebook and other social nets), you run the risk of someone connecting your alter ego with your actual identity -- like, say, through an email address or Web site associated with a comment. (In the PatientsLikeMe case, some of the site's users linked from their pseudo identity to blogs that identified them by name.) Scrape together enough information, and it's possible to put together a fairly robust profile of a person. It could be happening to you right now, as you read this.

What can you do? Be very careful about what you say online, especially in anger. If you use a pseudonym, make sure it can't be traced to your real identity. Create an email address on Gmail or Yahoo Mail that doesn't use any part of your name in it, and use that for your online communications. Don't link to your Web site or blog unless you feel OK with people connecting what you just said to who you are. Be very wary of revealing anything confidential or embarrassing on message boards or in comments, even anonymously. Most of them log your IP address, which can ultimately be traced back to your Internet account.

More important though, the issue of scraping brings up the bigger question of who owns -- and controls -- the information you put out there on the Web. Data miners will always claim that they do. I think they're wrong. But without some kind of Federal protection for our personal information, that's a fight we'll invariably lose.

ITworld TY4NS blogger Dan Tynan will never sell or share your personal information, but he might read it from time to time and snicker. Visit his snarky humor site eSarcasm (Geek Humor Gone Wild) or follow him on Twitter:@tynan_on_tech.

What’s wrong? The new clean desk test
Join the discussion
Be the first to comment on this article. Our Commenting Policies