Spider repellent
Its great to know that you have risen high enough above the background noise of the Net to be routinely spidered by the big search engines. On the other hand, if you have lots of content being browsed by lots of clients, the extra load of being scanned might be the straw that breaks the load balancers back -- so to speak.
Theres also the issue that you might not want all of your content indexed, just the parts that really describe and outline who you are and what you do.
If any of this applies to your site, you should be aware of the Robot Exclusion Protocol (REP).
REP is a simple mechanism that tells a visiting spider what should and should not be examined. Note that I wrote " should. " Theres no way that a publicly accessible site can enforce robot exclusion, short of keeping a list of all visiting IP addresses that run spiders and explicitly blocking them at the Web server or at a firewall -- methods that would not only be laborious but also inaccurate.
To use REP, all you need to do is create a file named robots.txt (it must be in lower case) in the root of your Web site (this is the only valid place for the file to exist).
The contents of this file consist of a list of directives. The first specification is " User-agent: " followed on the same line by the name of the spider you want to restrict. This directive is also treated as the start of a " record, " a group of directives that applies to a specific spider definition. Note that blank lines cant be used in the middle of a record.
So, to specify all spiders:
User-agent: *
On the next line, or lines, are the partial URLs that should be omitted from the spiders scan, thus:
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /~mgibbs/
If you want to specify all URLs are to be excluded then:
Disallow: /
To add comments you precede the comment with a " # " either on the same line as a directive or at the start of a line. So:
# robots.txt created 03-26-01
# let em all see this
User-agent: *
Disallow: /cgi-bin/
# the following is a just-in-case
# they should never get in there!
Disallow: /private/
Disallow: /~stuff/ # nothing public here
Of course, you might want to block certain spiders from everything on your site while allowing specific spiders a degree of access. To do this you need to add more blocks, thus:
# robots.txt created 03-22-01
# let em all see this
User-agent: *
Disallow: /cgi-bin/
# the following is a just-in-case
# they should never get in there!
Disallow: /private/
Disallow: /~stuff/ # nothing public here
User-agent: nother-spider
Disallow: # no URL means no exclusions
User-agent: some-spider
Disallow: /
You can also put restrictions on what can be indexed on specific pages -- but for that, youll have to wait for the next installment of the Web Applications Newsletter.
» posted by ITworld staff
Network World
Symantec Backup Exec 12 and Backup Exec System Recovery 8 deliver industry leading Windows data protection and system recovery. Download this whitepaper to find out the top reasons to upgrade and how to get continuous data protection and complete system recovery.
Data and system loss — from a hard drive failure, malicious attack, natural disaster, or simple human error — can happen anytime. Don’t leave your business vulnerable. Make sure you have a secure recovery strategy in place. Symantec's latest backup and system recovery technology can efficiently restore critical applications, individual emails and documents and even restore your entire system in minutes in the event of a loss.
Businesses face a growing challenge to ensure that the IT environment is properly protected. Backup Exec 12 integrates with other applications in the Symantec family of products, to complement your current data protection strategy, keep your data securely backed up and make it recoverable when you need it most.







