SiteProNews: 08/29/03 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SiteProNews: http://www.sitepronews.com
  HTML version available at: http://www.sitepronews.com/archives.html
Stopping & Directing Spiders
by Michael Bloch ©Copyright 2003

Not all agents, (otherwise known as crawlers, bots, robots and 
spiders), that visit your site will be of benefit. Even the 
"good" spiders such as the ones Google sends out to index your 
site may visit places that you don't wish them to.
  
Malicious spiders or web strippers can cause you a great deal of 
grief by taking up server resources and increasing your bandwidth 
usage - this can result in excess bandwidth fees. People use web 
stripper applications (also known as offline browsers) to download 
your entire site. Sometimes their goal is fairly innocent - to go 
through your site while offline using a locally stored copy. In 
other circumstances, there may be a much more devious goal - 
plagiarism or hacking.

How To Ban Spiders

If you notice entries such as Teleport Pro and WebStripper in 
your traffic reporting, there is something you can do about it 
- either via a robots.txt file or through meta-tags.

Robots Meta-Tag

The robots exclusion tag is very simple to implement, but it's 
mainly of benefit in keeping search engine spiders out of 
sensitive areas. Unfortunately, most web stripping applications 
ignore it.

The following META tags can be used and should be placed between 
your <head> and </head> tags:

<META NAME="ROBOTS" CONTENT="NOINDEX">

This will prevent most search engine spiders and some web 
strippers from accessing the page.

Another method: 

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

The page will be still be indexed, but any hyperlinks in that 
page will not be followed by the spider. 

The best method is to combine the two:

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

The page will not be indexed and no links will be followed.

Robots.txt File

The robots.txt file is a more powerful strategy. It is a text 
file that contains instructions on what to allow/disallow agents 
and spiders to view and spider on your site. These rules are 
called The Robots Exclusion Standard.

ThinkHost doesn't place a default robots.txt file in your web 
when you open an account, so you'll need to create one in notepad 
and upload it via FTP to your docs directory. If you are using 
Microsoft Frontpage, save the file to the root directory of your 
disk based web and then upload via FrontPage's standard 
HTTP:// publishing function.

Never use a blank robots.txt file as some search engines may see 
this as an indication that you don't want your site spidered at 
all! Have at least one entry in the file and remember to skip a 
line between entries. Also ensure that the spider/agent that you 
are banning doesn't turn out to be a legitimate software browser.

To prevent specific agents and spiders from having any access 
to your site, put these lines into the robots.txt file:

User-agent: NameOfAgent
Disallow: /

You must record the name of the agent exactly as it appeared in 
your traffic reports; for example WebZip/4.0.

User-agent: WebZip/4.0
Disallow: /

Skip a line between entries. You could do the same to exclude 
search engine spiders such as Googlebot. The "/" means disallow 
access to any directory. 

You can also prevent access to specific folders:

User-agent: *
Disallow: /cgi-bin/

In this example the *  indicates "all" but please note that the 
wildcard (*) cannot be used on the Disallow line, use "/" 
instead.

Example robots.txt file

If you would like some sort of guide and further examples of a 
robots.txt file, you can take a look at the one we use on the 
ThinkHost site. View it here:

http://www.thinkhost.com/robots.txt

Our file is by no means complete, but it does contain a number 
of "idiot" bots that repeatedly attempt to strip our main site. 
Please be aware that robots.txt will not stop all web stripping 
activity as many strippers can fake agent names, but it will 
help you save on bandwidth.

Good Spiders

If you would like to be able to identify the "good" spiders that 
may visit your site, you can view a listing of the most popular 
search engines' robots in our tutorial, "Understanding your web 
site traffic":

http://www.thinkhost.com/services/kb/interpreting-statistics.shtml

================================================================
Aricle by Michael Bloch of Team ThinkHost 
(http://www.thinkhost.com) Thinking Hosting? ThinkHost!
================================================================

Copyright © 2003 Jayde Online, Inc.  All Rights Reserved.

SiteProNews is a registered service mark of Jayde Online, Inc.