Site   Web

January 26, 2011

How Search Engines Work : A Case Study Of Google

Search Engines, especially Google have evolved technologically (amongst other parameters) over the years. The computing power of the software and hardware now deployed by the search giant can better be assessed in terms of the functions it performs and its wide reach.

How Search Engines Work

Broadly speaking, search engines’ functions can be divided into three:

* Crawling

This is the use of special software commonly known as bots, crawlers or spiders to access information on various websites through principally three means:

1. Links from other websites already in the search engine’s index or gathered while crawling

2. Url’s/links submitted by webmasters

3. Sitemaps submitted by webmasters

Ordinarily one would visualize the bots as some crawling objects moving rapidly all over the web via links to reach different websites in performing its tasks. However, in reality that is not the case. It operates from a particular physical location and is akin to your web browser. It operates by sending various requests to the web servers from which it downloads/fetches various information on new web pages, updated web pages and dead links which are all used to update it’s index. As web pages are crawled, new links detected on these web pages are added to the engine’s list of pages to crawl.

In the process of crawling, the engines encounter challenges in the sense that there is a trade off between minimizing the resources it spends on crawling and maintaining an up to date index. It tries to avoid re-indexing an unchanged web page while it strives to capture all changed web pages in order to keep its index always current.

* Indexing

The search engines stores the pages it’s crawlers retrieve from various web pages in a massive index database. It sorts this information based on search terms and arranges it in alphabetical order. This sorting enables rapid retrieval of documents from the index when search queries demand them. It processes the words in the web pages noting the location of the keywords within the pages e.g. title tags, alt attributes. The engines do process many, but not all content types. As an illustration, it cannot process the content of some rich media files or dynamic pages.

To improve search performance, the search engines ignore (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). These words are so common and do little to narrow a search, and therefore can safely be ignored. The indexer also ignores some punctuation and multiple spaces, in addition to converting all letters to lowercase, to improve it’s performance.

* Search Query Processor

This is what most search users are conversant with and in fact quite often erroneously regard as the “search engine”. It comprises some components with the most visible being the search box or interface through which the search user interacts with the search engine, forwarding his search query for processing.

When a user sends in a query through the interface, the index rapidly retrieves the most relevant documents for the search query. Relevance is determined algorithmically based on many ranking factors numbering over 200. A key factor amongst these is PageRank which is a measure of the importance of a web page. This is determined by the number and quality of links pointing to the web page. It is however important to stress that not all links are equal as links emanating from high ranked web pages is considered more powerful than links from low ranked web pages.

Why not visit => where many internet marketing newbies are attaining internet marketing success. Dele Ojewumi is an Internet Marketer, Chartered Accountant and Economist and the webmaster of =>