SiteProNews: February 25, 2008 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SiteProNews: http://www.sitepronews.com
  HTML version available at: http://www.sitepronews.com/archives.html

How Google Applies Science to Search (Part 1)
By Kalena Jordan (c) 2008

Dr. Craig Nevill-Manning is a New Zealander who joined Google in
2000 as a Senior Research Scientist to develop more precise
search techniques. Previously, Craig was an assistant professor
at the Computer Science Department of Rutgers University, where
he conducted research in data compression, information retrieval
and computational biology. Before that, he was a post-doctoral
fellow in the Biochemistry Department of Stanford University,
where he developed a software suite used by pharmaceutical
research laboratories to identify the role of particular
proteins within cells.

A scientist at heart, Craig is probably best known as the
developer of Froogle [http://www.google.com/products] (recently
re-named Google Product Search) and the founder of Google's
software engineering center in New York City.

This article is a summary of his presentation at Webstock 2008
(http://www.webstock.org.nz/).

Google's Spelling Bee

Craig started his presentation by talking about one of his first
challenges in his job at Google: the spelling correction tool.
As the popularity of the search engine grew, Google needed to be
able to spell-correct lots of obscure words. So his solution was
to take a sampling of content from the entire web. Craig's team
came up with a algorithmic model and ran it over the web. He
discovered that there were several correct answers to the same
question. For example, words like "kofee" could mean either the
searcher is seeking a cup of java or information about
Kofee/Kofi Anan.

To combat this, Craig came up with an interesting solution: the
"Did you mean?" alternative spelling option, based on
predictive examples of searcher spelling patterns. You can see
this in action if you type in "kofee anan" in Google. Above
the search results is a line that reads: "Did you mean: kofi
annan" and links to the search results for this spelling
variation too.

But the research went even further. Craig's team worked out how
to take into account the context of the search query by studying
the two or three other keywords surrounding the query, for
example "kofee cup" or "kofee anan". The research used the
science of bigrams (http://en.wikipedia.org/wiki/Bigram) and
trigrams (http://en.wikipedia.org/wiki/Trigram) to better
understand how people search. Bigrams are groups of two written
letters, two syllables, or two words, very commonly used as the
basis for simple statistical analysis of text. So Craig and his
team applied this knowledge to Google's spelling correction
system and now, Google's algorithm can determine the searcher's
intent with much more accuracy, based on the context of the
search query.

As an example of the spelling challenges that Google faces, Craig
showed the audience the huge number of ways "Britney Spears"
is misspelled (http://labs.google.com/britney.html) on the web.
He said it's encouraging to see that the most popular spelling
is also the most correct one. Scale is important!

Google Maps Lead to Apps

The Google team wrote the code for Google Maps
(http://maps.google.com/) many years ago but the code was
actually built into your browser. When Google maps first
launched, people took the dense data-script and worked out
how to reverse engineer it for their own use. Google engineers
decided to release an API key to make these mash-ups easier
after seeing so many people reverse engineer Google Maps without
Google's help. Now people can mash-up Google maps within minutes
to create their own applications.

To show how easy this is to do, Craig took the audience through
the steps to create an interactive application with Google Maps.
In the space of about two minutes, he signed up for an API key,
grabbed the HTML code and pasted it into his page. He then
hacked the map to show Wellington Town Hall (our location) and
made the point how easy it is to create really useful tools out
of technology that is already available.

As an example, Craig showed the audience Seattle Bus Monster
(http://www.busmonster.com/). This site uses an API key for
Google Maps to make Seattle bus data and tracking available 24/7.
Anyone who needs to catch a bus can look online and instantly
find their nearest bus location and run to the bus stop in time
to catch it. It's these types of interactive applications that
add value to both corporate and government sites.

Craig referenced Rodney Brooks from MIT whose provocative paper
"Fast, Cheap and Out of Control" offered new logic and a
completely different view of machines. The idea is that there is
no center of control among robots so you should make lots of
them; don't treat them so precious. Craig said developers
should use this logic to create lots of small apps that you can
replicate and tweak, rather than one big expensive app that can
go horribly wrong. Scale trumps smarts every time!

Experiments in Scale That Have Impacted Google's Operations

Precision vs. Recall

Back in the early 90's, information retrieval on the web was
limited to things like Lexus/Nexus. So at that stage, Google
would take queries and apply it to the broadest possible search.
This was great recall at the cost of precision. But Larry and
Sergey wanted something better so they decided to use Boolean
search. At the time it was heresy because everything was focused
on recall. But the Google founders knew that things had to be
super relevant so they developed an algorithm - the core
algorithm. It was very simple and relied on Boolean search to
determine relevancy.

Genomic Sequencing

In the mid 90's a large project - the Human Genome Project
(http://en.wikipedia.org/wiki/Human_Genome_Project) - was
underway. The race was on to sequence the genome. Scientists
decided to feed this out to a bunch of different people. They
chopped up the genome for researchers everywhere and allowed it
to replicate. The researchers mapped each chunk with genetic
markers and computed a tiling path of tiny fragments.

Sequencing was very expensive, so the data was computed based on
a minute number of chunks - very labor intensive. The sequencing
took forever and reassembling was a long way off. But then a
company came along that said they could do it faster. Sequencing
becomes cheaper by automating the job using machines rather than
individual people so this company used a clever computer
algorithm to conduct the sequencing. This reduced the cost and
the researchers were therefore able to reassemble more fragments
and achieve a rough draft of the genome in 2000. This sequencing
approach was the shotgun approach, where accuracy is lower, but
the larger scale allowed the impossible to become possible.

Web Definitions

Google used to do a terrible job of defining terms. Craig
noticed people were searching for "definition of...", or
"what is a...." etc so he wanted the search engine to provide
better results for these searches. He found lots of web pages
that contained glossaries and definitions, so he hacked up a
Perl script to get the glossary formats.

The first recall results were only 50 percent accurate. He
wanted to improve this rate, so he did some experiments with the
data. But he could never reach an accuracy level he was happy
with. It was later he realized that most of the questions people
actually needed answers to could be answered with his crappy
little Perl script. He concluded that 100 percent accuracy is
not important, that scale is much more important.

Now Google allows you to use the "definition:" query and the
question format to get definitions from around the web. Type in
"what is a blog?" and you'll get lots of results from
Craig's original script.

Protein Sequencing

In biology, Craig says, you're constantly producing proteins.
The proteins fold up with particular sequencing. Within
computing, you can use this knowledge to do amazing things. You
can conduct computations with this type of data but it's time
consuming. Somebody at Stanford University noticed that proteins
spend a lot of time moving about before folding into an alpha
helix. So it was suggested they start the computations with lots
of configurations. In this way you can parallelize the data by
scale and one will be magically close to a folded protein. So
they worked out a way to reduce the problem to a simple process
based on mass scale. This is why Google uses maximum scale to
conduct algorithmic computations.

Chess vs. Go

You can now compute the value of any potential move in chess.
Based on that information, you can compute your projected
probability of winning the game from any move. Chess grand
masters put a lot of time into this knowledge. But the opposite
is true for the game Go, because there is more randomness to the
game play.

(Stay tuned for Part 2)
================================================================
Article by Kalena Jordan, one of the first search engine
optimization experts in Australia, who is well known and
respected in the industry, particularly in the U.S. As well as
running a daily Search Engine Advice Column
(http://www.searchenginecollege.com/blog.htm), Kalena manages
Search Engine College (http://www.searchenginecollege.com/)- an
online training institution offering instructor-led short courses
and downloadable self-study courses in Search Engine Optimization
and other Search Engine Marketing subjects.
================================================================

Copyright © 2008 Jayde Online, Inc.  All Rights Reserved.

SiteProNews is a registered service mark of Jayde Online, Inc.