SiteProNews: July 10, 2006 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SiteProNews: http://www.sitepronews.com
  HTML version available at: http://www.sitepronews.com/archives.html
Pushing Bad Data - Google's Latest Black Eye
By Eric Lester (c) 2006

Google stopped counting, or at least publicly displaying, the
number of pages it indexed in September of 05, after a
school-yard "measuring contest" with rival Yahoo. That count
topped out around 8 billion pages before it was removed from
the homepage. News broke recently through various SEO forums
that Google had suddenly, over the past few weeks, added
another few billion pages to the index. This might sound like a
reason for celebration, but this "accomplishment" would not
reflect well on the search engine that achieved it.

What had people buzzing was the nature of the fresh, new few
billion pages. They were blatant spam- containing Pay-Per-Click
(PPC) ads, scraped content, and they were, in many cases,
showing up well in the search results. They pushed out far
older, more established sites in doing so. A Google
representative responded via forums to the issue by calling it
a "bad data push," something that met with various groans
throughout the SEO community.

How did someone manage to dupe Google into indexing so many
pages of spam in such a short period of time? I'll provide a
high level overview of the process, but don't get too excited.
Like a diagram of a nuclear explosive, it isn't going to teach
you how to make the real thing, you're not going to be able to
run off and do it yourself after reading this article. Yet it
makes for an interesting tale, one that illustrates the ugly
problems cropping up with ever increasing frequency in the
world's most popular search engine.

A Dark and Stormy Night

Our story begins deep in the heart of Moldva, sandwiched
scenically between Romania and the Ukraine. In between fending
off local vampire attacks, an enterprising local had a
brilliant idea and ran with it, presumably away from the
vampires... His idea was to exploit how Google handled
subdomains, and not just a little bit, but in a big way.

The heart of the issue is that currently, Google treats
subdomains much the same way as it treats full domains- as
unique entities. This means it will add the homepage of a
subdomain to the index and return at some point later to do a
"deep crawl." Deep crawls are simply the spider following links
from the domain's homepage deeper into the site until it finds
everything or gives up and comes back later for more.

Briefly, a subdomain is a "third-level domain." You've probably
seen them before, they look something like this:
subdomain.domain.com. Wikipedia, for instance, uses them for
languages; the English version is "en.wikipedia.org", the Dutch
version is "nl.wikipedia.org." Subdomains are one way to
organize large sites, as opposed to multiple directories or
even separate domain names altogether.

So, we have a kind of page Google will index virtually "no
questions asked." It's a wonder no one exploited this situation
sooner. Some commentators believe the reason for that may be
this "quirk" was introduced after the recent "Big Daddy"
update. Our Eastern European friend got together some servers,
content scrapers, spambots, PPC accounts, and some
all-important, very inspired scripts, and mixed them all
together thusly...

Five Billion Served - And Counting...

First, our hero here crafted scripts for his servers that
would, when GoogleBot dropped by, start generating an
essentially endless number of subdomains, all with a single
page containing keyword-rich scraped content, keyworded links,
and PPC ads for those keywords. Spambots are sent out to put
GoogleBot on the scent via referral and comment spam to tens of
thousands of blogs around the world. The spambots provide the
broad setup, and it doesn't take much to get the dominos to
fall.

GoogleBot finds the spammed links and, as is its purpose in
life, follows them into the network. Once GoogleBot is sent
into the web, the scripts running the servers simply keep
generating pages- page after page, all with a unique subdomain,
all with keywords, scraped content, and PPC ads. These pages get
indexed and suddenly you've got yourself a Google index 3-5
billion pages heavier in under 3 weeks.

Reports indicate, at first, the PPC ads on these pages were
from Adsense, Google's own PPC service. The ultimate irony then
is Google benefits financially from all the impressions being
charged to Adsense users as they appear across these billions
of spam pages. The Adsense revenues from this endeavor were the
point, after all. Cram in so many pages that, by sheer force of
numbers, people would find and click on the ads in those pages,
making the spammer a nice profit in a very short amount of time.

Billions or Millions? What is Broken?

Word of this achievement spread like wildfire from the
DigitalPoint forums. It spread like wildfire in the SEO
community, to be specific. The "general public" is, as of yet,
out of the loop, and will probably remain so. A response by a
Google engineer appeared on a Threadwatch thread about the
topic, calling it a "bad data push". Basically, the company
line was they have not, in fact, added 5 billion pages. Later
claims include assurances the issue will be fixed
algorithmically. Those following the situation (by tracking the
known domains the spammer was using) see only that Google is
removing them from the index manually.

The tracking is accomplished using the "site:" command. A
command that, theoretically, displays the total number of
indexed pages from the site you specify after the colon. Google
has already admitted there are problems with this command, and
"5 billion pages", they seem to be claiming, is merely another
symptom of it. These problems extend beyond merely the site:
command, but the display of the number of results for many
queries, which some feel are highly inaccurate and in some
cases fluctuate wildly. Google admits they have indexed some of
these spammy subdomains, but so far haven't provided any
alternate numbers to dispute the 3-5 billion shown initially
via the site: command.

Over the past week the number of the spammy domains &
subdomains indexed has steadily dwindled as Google personnel
remove the listings manually. There's been no official
statement that the "loophole" is closed. This poses the obvious
problem that, since the way has been shown, there will be a
number of copycats rushing to cash in before the algorithm is
changed to deal with it.

Conclusions

There are, at minimum, two things broken here. The site:
command and the obscure, tiny bit of the algorithm that allowed
billions (or at least millions) of spam subdomains into the
index. Google's current priority should probably be to close
the loophole before they're buried in copycat spammers. The
issues surrounding the use or misuse of Adsense are just as
troubling for those who might be seeing little return on their
advertising budget this month.

Do we "keep the faith" in Google in the face of these events?
Most likely, yes. It is not so much whether they deserve that
faith, but that most people will never know this happened. Days
after the story broke there's still very little mention in the
"mainstream" press. Some tech sites have mentioned it, but this
isn't the kind of story that will end up on the evening news,
mostly because the background knowledge required to understand
it goes beyond what the average citizen is able to muster. The
story will probably end up as an interesting footnote in that
most esoteric and neoteric of worlds, "SEO History."
================================================================
Mr. Lester worked in the IT industry for 5 years, acquiring
knowledge of hosting, website design, before serving for 5 years
as the webmaster for Apollo Hosting (http://www.apollohosting.com).
Apollo Hosting provides website hosting, ecommerce hosting, vps
hosting, and web design services to a wide range of customers.
================================================================

Copyright © 2006 Jayde Online, Inc.  All Rights Reserved.

SiteProNews is a registered service mark of Jayde Online, Inc.