Most web sites flooded by alien visitors

Study shows 'good' and 'bad' bots combine to make up 51% of typical web traffic

Most web-site owners or managers consider increases in web traffic to be a good thing, but few take into account how much of that traffic comes not only from sources that are not human, but non-human sources that are also malicious.

Fifty-one percent of all web traffic comes from bots, software agents and other non-human sources, according to a study released yesterday by security and performance-management web-service company Incapsula.

About 20 percent of a site's total traffic comes from "good" non-human sources including search-engine spiders and other software agents collecting information for legitimate purposes that ultimately increase human traffic, according to the study, which analyzed traffic records from one thousand Incapsula customers, each of which received between 50,000 and 100,000 visitors per month.

The bad news for web-site owners, privacy advocates and security specialists is that almost a third – 31 percent – of all the traffic to an average site came from "bad" bots that insert spam messages into comment fields, scrapers that steal content to be reposted for profit on someone else's site, automated hacking tools and an array of spy tools designed to give the site's competitors a detailed view of its operations.

Malware-ish sources such as those are simple enough to block, but only for site managers who realize how high the volume of bad traffic is and who have the tools to filter it out, according to the report, from a company which sells a service designed to filter exactly that kind of traffic.

Incapsula's report claims most site owners don't know how much of their traffic is non-human because Google Analytics, the most common traffic monitor for small- and mid-sized sites, doesn't identify the difference between 'bot and human visitors.

The truth is a little more subtle.

Google Analytics collects visitor data using JavaScript, which most web crawlers and other "good" search-engine 'bots don't activate. So Google Analytics typically won't count traffic from Googlebot or other search engines as legitimate traffic, according to web-performance service Yottaa.

Plenty of "bad" 'bots are designed to run JavaScript on and accept cookies from the pages they visit, making them harder to identify and allowing them to be counted under Google Analytics' default settings.

It is possible to filter out traffic from bad bots using scripting and filtering tools built into Google Analytics, which are routinely used to filter traffic from a site's developers or staff from external "real" traffic.

Filtering bots is tricky – because of the variations among them, not the complexity of creating filters. Yottaa and other web-management-advice sites offer plenty of tips and guidelines, including Google's, however.

Keeping ahead of the 'bots poses the same problem as keeping ahead of spammers, who change their packages and payloads to avoid filters just as quickly as new filters are created to stop them.

There are plenty of Black Hat SEO tools and service companies, bogus-traffic generators and hacking tools available to help, that let even relatively non-technical competitors create malicious 'bot traffic if they choose.

Depending on the severity of the problem and the value you put on the time to stop it, it's possible to segment your traffic, create honey pots to trap bogus visitors and add other security-enhancing hoops through which bots would have to jump before being either counted as legitimate traffic or getting access to potentially sensitive data.

Before you go to the trouble, evaluate the risk involved and the cost to fix it. The value of perfect 'bot filtration is up for debate even among the black-hats who use or create the tools.

It's not unlikely that half of a typical site's web traffic does come from non-human sources, but it's also possible that sites whose security managers are unfamiliar with either Google Analytics filters or 'bot signatures could put aggressive 'bot blockers in place that will actually stop human visitors while letting the non-humans through.

The key is not to conceal a site from 'bots or block every HTTP request that hasn't proven it's from a human. The key is to know the difference between 'bot traffic and human and know how much effort is justified to block one or the other.

Read more of Kevin Fogarty's CoreIT blog and follow the latest IT news at ITworld. Follow Kevin on Twitter at @KevinFogarty. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies