Spider repellent

By Mark Gibbs, Network World |  Networking

It’s great to know that you have risen high enough above the background noise of the ‘Net to be routinely spidered by the big search engines. On the other hand, if you have lots of content being browsed by lots of clients, the extra load of being scanned might be the straw that breaks the load balancer’s back -- so to speak.

There’s also the issue that you might not want all of your content indexed, just the parts that really describe and outline who you are and what you do.

If any of this applies to your site, you should be aware of the Robot Exclusion Protocol (REP).

REP is a simple mechanism that tells a visiting spider what should and should not be examined. Note that I wrote " should. " There’s no way that a publicly accessible site can enforce robot exclusion, short of keeping a list of all visiting IP addresses that run spiders and explicitly blocking them at the Web server or at a firewall -- methods that would not only be laborious but also inaccurate.

To use REP, all you need to do is create a file named robots.txt (it must be in lower case) in the root of your Web site (this is the only valid place for the file to exist).

The contents of this file consist of a list of directives. The first specification is " User-agent: " followed on the same line by the name of the spider you want to restrict. This directive is also treated as the start of a " record, " a group of directives that applies to a specific spider definition. Note that blank lines can’t be used in the middle of a record.

So, to specify all spiders:

User-agent: *

On the next line, or lines, are the partial URLs that should be omitted from the spider’s scan, thus:

Disallow: /cgi-bin/

Disallow: /private/

Disallow: /~mgibbs/

If you want to specify all URLs are to be excluded then:

Disallow: /

To add comments you precede the comment with a " # " either on the same line as a directive or at the start of a line. So:

# robots.txt created 03-26-01

# let ‘em all see this

User-agent: *

Disallow: /cgi-bin/

# the following is a just-in-case

# they should never get in there!

Disallow: /private/

Disallow: /~stuff/ # nothing public here

Of course, you might want to block certain spiders from everything on your site while allowing specific spiders a degree of access. To do this you need to add more blocks, thus:

# robots.txt created 03-22-01

# let ‘em all see this

User-agent: *

Disallow: /cgi-bin/

# the following is a just-in-case

# they should never get in there!

Disallow: /private/

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Ask a Question
randomness