Site   Web

April 2, 2009

SMX Sydney: Unraveling URLs and Demystifying Domains

The following is a live blogging post of the session by Greg Grothaus, Senior Software Engineer for Google on Unraveling URLs and Demystifying Domains.

Greg says he is a geek who has been a software engineer at Google for 4 years. He works closely with Matt Cutts and his crack anti-spam taskforce.

Greg starts his presentation by breaking down the anatomy of an URL: is the Domain  is the Host

/maps/mm: is the Path

q=sydney: is the Query

#SMX: is the Fragment

Google doesn’t index fragments.

Google’s Guidelines for Site Structure:

General tips:

1) Use a tree-like organization

2) Use similar topics in the same part of the tree

3 ) Sub-domains vs sub-directories? Don’t fret but often directories are easier

4) Multiple domains :

– won’t get tabbed over UI

– more results from

– harder to build reputation

– still can be the best option

5) Link everthing together in an organized way

Google’s Guidelines for URLs

Keep URLs:

– organized

– sharable beteween users (each item reference-able among friends)

– Linked within several hops (no orphan pages)

– Largely unique content per URL (avoid serving different languages on same URLs)

Duplicate Content

Some URLs are all different but others are seen as the same by Google. So how to fix duplicate content issues?

Canonical means reduced to the simplest and most significant for, possible without loss of generality. Therefore:

– pick one canonical URL for each page and ensure you link consistently within your site

– make all the non-canonical urls throw a permanent 301

– on Google’s Webmaster Tools, specify www vs. non-www in the console

Google’s Guidelines for Proper use of Response Codes

Use 301s for permanent redirects. That signals to a search engine to transfer properties.

Aypical duplicate content issue is the Printer Friendly version of your page. Another dupe content issue is the navigation paths when shown in the URL of dynamic sites that offer several ways to reach the same page i.e. tents/bags/red/tent_bag.html vs bags/tents/red/tent_bag.html

New Option for Duplicate Content

Use the Canonical Link Element added at page level e.g. becomes


<link rel=”canonical” href=””>


For more information, search Google for “specify your canonical” to find canonical tag to use.

Use this only on the same domain. It works across sub-domains and hosts. You can use it instead of a 301 to resolve canonical issues. But it should only be used for pages that are identical or very similar.

Absolute vs Relative URLs

Google suggests that you use absolute URLs when structuring your site. Better for Googlebot, plus, they leave less room for error. Google CAN follow a chain of canonicals but don’t count on it. Point directly to the final URL.

Make your Site Accessible Without Forms

– most often, search engine crawlers don’t complete pull-down menus within forms

– very rarely can search engines follow JavaScript links

– don’t count on Flash content being indexed the way you want

Rich Media with HTML Content and Navigation

– use image tags

– use text descriptions

– Consider slFR for Flash

– Google can index Flash directly but it’s not perfect, therefore consider using slFR

– JavaScript detects if Flash is installed and Google has trouble indexing JavaScript

For more information, do a Google search for “best uses in Flash” for more info on slFR

Eliminate Soft 404s

– 404s confuse users

– 404s can cause duplicate content for search engines and crawlers may not discovere new pages

– tell Google what pages are real and which aren’t so they don’t strike soft 404s. Use Google Webmaster Tools to find and eliminate 404s.

Submit a General Web Sitemap

Sitemaps can influence Google’s understanding of your site.

Declare your sitemap URL in robots.txt You can also upload several specific XML sitemaps for your rich media e.g. news, video etc

Indexing Stats Include Sitemap Submissions

Your sitemap statistics in Google Webmaster Tools can give you all sorts of information about your site’s health and the status of your sitemaps and how they are being indexed by Google.