SiteProNews: May 16, 2005 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SiteProNews: http://www.sitepronews.com
  HTML version available at: http://www.sitepronews.com/archives.html
How To Control Search Engine Robots
By Michael Rock (c) 2005

Wouldn't it be nice to be able to leave some code in your web
site to tell the search engine spider crawlers to make your site
number one? Unfortunately a robots.txt file or robots meta tag
won't do that, but they can help the crawlers to index your site
better and block out the unwanted ones.

First a little definition explaining:

Search Engine Spiders or Crawlers - A web crawler (also known as
web spider) is a program which browses the World Wide Web in a
methodical, automated manner. Web crawlers are mainly used to
create a copy of all the visited pages for later processing by a
search engine, that will index the downloaded pages to provide
fast searches.

A web crawler is one type of bot, or software agent. In general,
it starts with a list of URLs to visit. As it visits these URLs,
it identifies all the hyperlinks in the page and adds them to
the list of URLs to visit, recursively browsing the Web
according to a set of policies.

Robots.txt - The robots exclusion standard or robots.txt
protocol is a convention to prevent well-behaved web spiders and
other web robots from accessing all or part of a website. The
information specifying the parts that should not be accessed is
specified in a file called robots.txt in the top-level directory
of the website.

The robots.txt protocol is purely advisory, and relies on the
cooperation of the web robot, so that marking an area of your
site out of bounds with robots.txt does not guarantee privacy.
Many web site administrators have been caught out trying to use
the robots file to make private parts of a website invisible to
the rest of the world. However the file is necessarily publicly
available and is easily checked by anyone with a web browser.

The robots.txt patterns are matched by simple substring
comparisons, so care should be taken to make sure that patterns
matching directories have the final '/' character appended:
otherwise all files with names starting with that substring will
match, rather than just those in the directory intended.

Meta Tag - Meta tags are used to provide structured data about
data.

In the early 2000s, search engines veered away from reliance on
Meta tags, as many web sites used inappropriate keywords, or
were keyword stuffing to obtain any and all traffic possible.

Some search engines, however, still take Meta tags into some
consideration when delivering results. In recent years, search
engines have become smarter, penalizing websites that are
cheating (by repeating the same keyword several times to get a
boost in the search ranking). Instead of going up rankings,
these websites will go down in rankings or, on some search
engines, will be kicked off of the search engine completely.

Index a site - The act of crawling your site and gathering
information.

How can the robots.txt file and meta tag help you?

In the robots.txt you can tell the harmful 'web crawlers' to
leave your web site alone, and give helpful hints to the ones
you want to crawl your site. Here is an example on how to
disallow a web crawler to search your site:

# this identifies the wayback machine User-agent:
ia_archiver
Disallow: /

ia_archiver is the crawler name for the wayback machine that you
may have heard of, and the / after disallow tells ai_archiver
not to index any of your site. The # allows you to write
comments to yourself so you can keep track of what you typed.

Type the above three lines into notepad from your computer and
save it to the root directory of your web site as robots.txt.
Web crawlers look for this document first at a web site before
doing anything else. This helps the crawler to do its job, and
helps the web site owner tell the spider what to do. Say for
instance you have some data that you don't want the crawlers to
see. (Like duplicate content for other browser referrer pages)

You can deter crawlers from indexing the 'duplicate' directory
by typing this into your robots.txt file. Or if you would like
to have the robots.txt file created for you, visit
http://www.rietta.com/robogen. To validate your robots.txt file
to make sure it works properly you can visit
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

User-agent: *
Disallow: /duplicate/

The * after user-agent says that this action applies to all
crawlers and /duplicate/ after disallow tells all crawlers to
ignore this directory and not search it. For each user-agent and
disallow line there must be a blank space between them in order
for it to function correctly. So this is how you would create
the above two commands into a robots.txt file:

# this identifies the wayback machine
User-agent: ia_archiver
Disallow: /

User-agent: *
Disallow: /duplicate/

One thing to note that is very important: Anyone can access the
robots.txt file of a site. So if you have information that you
don't want anyone to see don't include it into the robots.txt
file. If the directory that you don't want anyone to see is not
linked to from your web site the crawlers won't index it anyway.

An alternative to blocking indexing of your site is to put a
meta tag into the page. It looks like this:

<meta name="robots" content="noindex,nofollow">

You put this into the  tag of your web page. This line
tells the robot crawlers not to index (search) the page and not
to follow any of the hyperlinks on the page. So as an example
<meta name="robots" content="noindex,follow"> tells the robot
crawlers to not index the page, but follow the hyperlinks
on this page.

Did You Know That Google Has Its Own Meta Tag?

It looks like this:
<meta name="googlebot" content="noindex,nofollow,noarchive">
This tells the Google robot crawler not to index the page, not
to follow any of the links, and not to keep from storing cached
versions of your web site. You will want this done if you update
the content on your site frequently. This prevents the web user
from seeing outdated content that isn't refreshed because of
storage in the cache.

You can use the meta tag to specifically talk to Google's robots
to avoid complications or if you are optimizing your site for
Google's search engine. This concludes this month's article.

Until the next article have a great day!

Copyright © Michael Rock Web development contractor (Web Design
and Hosting) Internet Presence http://www.TheInternetPresence.com

================================================================
The owner of this registered company has over twenty years
experience with DOS, windows business applications, numerous
programming languages, artistic development, and web design.
Other areas of interest include web marketing, web promoting,
and business marketing and development. After the persuasion of
those praising his work, he decided to go into business himself
and highly suggests everyone else to do the same.
================================================================

Copyright © 2005 Jayde Online, Inc.  All Rights Reserved.

SiteProNews is a registered service mark of Jayde Online, Inc.