SiteProNews: April 20, 2007 Feature Article |
|
To Print: Click here or Select File/ Print from your Browser Menu. |
Article printed from SiteProNews: http://www.sitepronews.com HTML version available at: http://www.sitepronews.com/archives.html
Use Robots.txt, Save the World
By Sante J. Achille (c) 2007
Robots.txt Help the Search Engines Learn All About Your Website
There is a growing interest in the little known file that every
website should have in the root directory: robots.txt
It's a very simple text file you can find all about at the
robotstxt.org (http://www.robotstxt.org/) website.
Why should you use it ? Here are some good reasons for you to
consider.
Controlled Access to Your Content
With a robots.txt file you can "ask" the search engines to "keep
out" of certain areas of your website. A typical area you might
like to exclude is your images folder: If you aren't a
photographer, painter and your images are for your website use
only, there are good chances you don't want them to be indexed
and showing up on image search engines, for people to download,
or hotlink.
Unfortunately grabbers and similar software (such as Email
harvesting applications) will not read your robots.txt file
disregarding any indication you may provide in this respect. But
that's life isn't it, always someone being disrespectful to
say the least ...
You can keep search engines away from content you wish to keep
out of sight, but remember your robots file is also subject to
attention of hackers seeking sensitive objectives you might
inadvertently list: keeping out the robots while inviting the
hackers – keep this in mind.
The Growing Importance of Robots.Txt
At SES New York a robots.txt summit was held where major search
engines (Ask, Google, Microsoft, Yahoo!) participated, sharing
interesting information on this file. Here are some numbers.
According to Keith Hogan from Ask:
i) Less than 35% of websites have a robots.txt file
ii) The majority of robots.txt files are copied from others
found online
iii) On many occasions robots.txt files are provided by your web
hosting service
It looks like the majority of webmasters aren't familiar with
this file. This is going to play a major role as the size of the
web continues to grow: Spidering is a costly effort that search
engines tend to optimize. Those web sites demonstrating optimal
command (which in turn determines efficiency) will be rewarded.
During the summit, all search engines announced they will
identify (or autodiscover) sitemaps via the robots.txt file. In
essence search engines are now able to discover your sitemap via
a link in the following format:
Sitemap: <sitemap_location>, where <sitemap_location> is the
complete URL of your Sitemap Index File (or your sitemap file,
if you don't have an index file).
Being Compliant to Google Terms of Service
Robots.txt can help prevent you getting banned or being
penalized by Google. In a move to eliminate search results pages
because "web search results don't add value to users" Google has
recently added the following sentence to their terms of
service:
- Use robots.txt to prevent crawling of search results pages or
other auto-generated pages that don't add much value for users
coming from search engines.
How to Implement a Robots.txt File
If your website doesn't support a sitemap and you do not have
any areas to exclude, include an empty robots.txt file in your
root directory. By doing so you are acknowledging full spidering
of your entire site.
Carefully review the robots exclusion protocol available at
robotstxt.org. If you must exclude numerous areas of your
website, build your file in a step by step manner and monitor
spider behaviour with a log analyser tool.
Test your robots.txt file with a few online tools and keep in
mind that every spider has a different behaviour and spidering
criteria.
Avoid Useless Spidering Traffic
When your website grows to a significant size and achieves
optimal visibility, spidering significantly increases to
hundreds (if not thousands) of hits per day and will put your
server and bandwidth to the test.
Recently I was called on to examine a blog burdened by a very
unusual and extremely heavy spidering activity: the log file I
examined reported an excess of 8 Gbyte of invisibile (spider)
traffic over a 1 month period. Given the reduced amount of daily
visitors (less than 200) and the reduced size of the blog (less
than 100 posts), there was something wrong in the architecture.
It took just a few minutes to identify the problem: There was no
robots.txt file.
At each request for a robots.txt there was a redirect to the
home page of the blog triggering a complete download of the blog
home page. Each download of the home page was approximately 250
K. There were thousands of unnecessary hits on the home page.
This was causing a spidering frenzy that ceased when an empty
robots.txt file was created and uploaded to the server. Traffic
is now down from 8 Gbyte to 500 Mbyte.
Keep the Spiders Informed, Help Save the World
The web is growing by leaps and bounds. The use of a robots.txt
file helps the search engines effectively allocate their
resources and is a tangible sign of respect and courtesy. If you
don't have a robots.txt file on your website set one up now.
Use it to inform the crawlers on how your site is organized, and
how often it is changing. I think we should all do our part to
avoid waste of resources, saving energy and helping to save the
world.
================================================================
Sante has an engineering degree and has worked the web since
1994. An accomplished speaker, Sante has appeared at many
European SES Conferences, including the first Italiain SES held
in Milan in April 2006. Appointed as an ICT consultant to the
regional government in Abruzzo, Sante has also appeared at the
Reykjavik Iceland Internet Marketing Conference and will be
presenting at SES Milan (http://www.searchenginestrategies.com/
sew/italy07/agenda.html) in late May.
================================================================
Copyright © 2007 Jayde Online, Inc. All Rights Reserved.
SiteProNews is a registered service mark of Jayde Online, Inc.