topics that matter; ideas worth sharing

share a tip, submit a link, add something new

Spider repellent

April 4, 2001, 02:11 PM —  Network World — 

It’s great to know that you have risen high enough above the background noise of the ‘Net to be routinely spidered by the big search engines. On the other hand, if you have lots of content being browsed by lots of clients, the extra load of being scanned might be the straw that breaks the load balancer’s back -- so to speak.

There’s also the issue that you might not want all of your content indexed, just the parts that really describe and outline who you are and what you do.

If any of this applies to your site, you should be aware of the Robot Exclusion Protocol (REP).

REP is a simple mechanism that tells a visiting spider what should and should not be examined. Note that I wrote " should. " There’s no way that a publicly accessible site can enforce robot exclusion, short of keeping a list of all visiting IP addresses that run spiders and explicitly blocking them at the Web server or at a firewall -- methods that would not only be laborious but also inaccurate.

To use REP, all you need to do is create a file named robots.txt (it must be in lower case) in the root of your Web site (this is the only valid place for the file to exist).

The contents of this file consist of a list of directives. The first specification is " User-agent: " followed on the same line by the name of the spider you want to restrict. This directive is also treated as the start of a " record, " a group of directives that applies to a specific spider definition. Note that blank lines can’t be used in the middle of a record.

So, to specify all spiders:

User-agent: *

On the next line, or lines, are the partial URLs that should be omitted from the spider’s scan, thus:

Disallow: /cgi-bin/

Disallow: /private/

Disallow: /~mgibbs/

If you want to specify all URLs are to be excluded then:

Disallow: /

To add comments you precede the comment with a " # " either on the same line as a directive or at the start of a line. So:

# robots.txt created 03-26-01

# let ‘em all see this

User-agent: *

Disallow: /cgi-bin/

# the following is a just-in-case

# they should never get in there!

Disallow: /private/

Disallow: /~stuff/ # nothing public here

Of course, you might want to block certain spiders from everything on your site while allowing specific spiders a degree of access. To do this you need to add more blocks, thus:

# robots.txt created 03-22-01

# let ‘em all see this

User-agent: *

Disallow: /cgi-bin/

# the following is a just-in-case

# they should never get in there!

Disallow: /private/

Disallow: /~stuff/ # nothing public here

User-agent: nother-spider

Disallow: # no URL means no exclusions

User-agent: some-spider

Disallow: /

You can also put restrictions on what can be indexed on specific pages -- but for that, you’ll have to wait for the next installment of the Web Applications Newsletter.

» posted by ITworld staff

Network World

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Resources
White Paper

Symantec Backup Exec 12 and Backup Exec System Recovery 8 deliver industry leading Windows data protection and system recovery. Download this whitepaper to find out the top reasons to upgrade and how to get continuous data protection and complete system recovery.

Webcast

Data and system loss — from a hard drive failure, malicious attack, natural disaster, or simple human error — can happen anytime. Don’t leave your business vulnerable. Make sure you have a secure recovery strategy in place. Symantec's latest backup and system recovery technology can efficiently restore critical applications, individual emails and documents and even restore your entire system in minutes in the event of a loss.

White Paper

Businesses face a growing challenge to ensure that the IT environment is properly protected. Backup Exec 12 integrates with other applications in the Symantec family of products, to complement your current data protection strategy, keep your data securely backed up and make it recoverable when you need it most.

Free stuff
Featured Sponsor

Get a broad understanding of important regulations and how you can make sure your site is in adherence.





Learn how VeriSign SGC-enabled SSL Certificates can help improve site security and customer confidence in the free white paper, "How to Offer the Strongest SSL Encryption." In this paper you will learn the differences between weak and strong encryption and what they mean for your site's performance.

Get VeriSign's free white paper: "The Latest Advancements in SSL Technology" and learn about the benefits of strong SSL encryption, Extended Validation (EV) SSL and security trust marks and what these SSL offerings can do for your site.

Now with Extended Validation (EV) SSL available from VeriSign, you can show your customers that they can trust your site. Learn about EV SSL benefits in this free VeriSign white paper.

More Resources