SiteProNews: September 20, 2006 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SiteProNews: http://www.sitepronews.com
  HTML version available at: http://www.sitepronews.com/archives.html
The Google Goal Of Indexing 100 Billion Web Pages
By Danny Wirken (c) 2006

Google's Goal of Quality Search

In their paper 'The Anatomy of a Large-Scale Hypertextual Web
Search Engine' it is very evident that Google's goal has always
been to be one of the best search engines there is in terms of
the quality of the results it gives. Sergey Brin and Lawrence
Page, however knew that in order to do this, Google needed to be
able to store information efficiently and cost effectively and
to have excellent crawling, indexing, and sorting methods or
techniques. Google not only aimed to give quality results but to
produce the results as fast as possible.

Google started as a high quality search engine and continues to
be the best search engine today. It has managed to stay true to
its original intent to be a search engine that not only crawls
and indexes the web efficiently but also a search engine that
produces more satisfying results in comparison to other existing
search engines.

To stay true to the goal of providing the best search results,
Google knew right from the start that it had to be designed so
that the search engine could catch up with the web's growth.
According to Brin and Page "In designing Google we have
considered both the rate of growth of the Web and technological
changes. Google is designed to scale well to extremely large
data sets. It makes efficient use of storage space to store the
index". They knew that they needed much space to store an ever
growing index.

Google's index size, which started out as 24 million web pages,
was large for its time and has grown to around 25 billion web
pages, still keeping Google ahead of its competitors. However,
Google is a company that doesn't settle for just beating the
competitors. They truly aim to give their users the best service
there is and that means as a search engine they want to give
users access to all or at least most of the quality information
that is available on the web.

Google's New System for Indexing More Pages

As mentioned earlier, Google aims to give access to even more
information and has been devoting time and much effort to
realize this goal. It seems that the new patent entitled
'Multiple Index Based Information Retrieval System' filed by
Google employee Anna Patterson might be the answer to the
problem. The patent published just this May of 2006 and filed
way back in January of 2005 shows that Google might actually be
aiming to expand their index size to as much as a 100 billion
web pages or even more.

According to the patent, conventional information retrieval
systems, more commonly known as search engines, are able to
index only a small part of the documents available on the
Internet. According to estimates, the existing number of web
pages on the Internet as of last year was around 200 billion;
however, Patterson claimed that even the best search engine
(that is Google) was able to index only up to 6 to 8 billion web
pages.

The disparity between the number of indexed pages and existing
pages clearly signaled a need for a new breed of information
retrieval system. Conventional information retrieval systems
just weren't capable of doing the job and just wouldn't be able
to index enough web pages to give users access to a large enough
percentage of the present existing information available on the
web.

The Multiple Index Based Information Retrieval System, however,
is up to the challenge and is Google's answer to the problem.
Two characteristics of the new system makes it stand out
compared to the conventional systems. One is that it has the
"capability to index an extremely large number of documents, on
the order of a hundred billion or more". And the other is its
capability to "index multiple versions or instances of documents
for archiving...enabling a user to search for documents within a
specific range of dates, and allowing date or version related
relevance information to be used in evaluating documents in
response to a search query and in organizing search results."

With the new system developed by Patterson, Google now has the
ability to expand its index size to unbelievable proportions as
well as improve document analysis and processing, document
annotation, and even the process of ranking according to
contained and anchor phrases.

History of Google's Index Size

Google started out with an index size of around 24 million web
pages in 1996. By August of 2000, Google had managed to quadruple
their index size to approximately one billion web pages. In
September of 2003, Google's front-page boasted an index of
3.3 billion web pages. Microdoc, however, revealed that the
actual number of web pages Google had indexed during that time
was already more than five billion web pages. In their article
'Google Understates the Size of Its Database', they emphasized
that Google not only specialized in simplicity but also in
understating their power and complexity. Google was still
managing to stay ahead of its competitors and continued to
surprise everyone with what they had up their sleeves.

As Google's index continued to grow the number in their front
page grew impressively large as well before it plateaued at eight
billion web pages. This was around the time that Patterson filed
the new patent. Then in 2005, with controversies in index size
growing, Google decided to stop counting in front of the public
and simply claimed that their index size was three times larger
than the nearest competitor's index size. Google also maintained
that it was not just the size of indexed pages that was
important but how relevant the results they returned were.

Then in September of 2005, as part of Google's 7th anniversary,
Anna Patterson, the same software engineer who filed the patent
on the Multiple Based Index Information Retrieval System posted
an entry on Google's official blog claiming that the index size
was now 1,000 times larger than the original index. This pegged
their index size at around 24 billion web pages, about a fourth
of Google's goal of indexing a 100 billion web pages. It seems
then that Google must have started using the new system in mid
2005. With the new system in place, we can only wait and see how
fast Google will reach the goal of a 100 billion web pages in
its index. It's most likely though that when Google has reached
that goal it will set an even higher goal to provide continuous
quality service.
================================================================
Danny Wirken is co-owner of http://www.theinternetone.net
an internet marketing website that primarily focuses on the many
aspects, methodologies and processes that are used in internet
marketing.
================================================================

Copyright © 2006 Jayde Online, Inc.  All Rights Reserved.

SiteProNews is a registered service mark of Jayde Online, Inc.