March 24, 2008
Duplicate content is a hotly debated issue when it comes to how it affects your web-site ranking. And it’s become an even bigger issue over time as spammers and other malicious Internet users have taken to the practice of content scraping, or scraping the content from a web site to use on their own with only minor changes to the appearance, not to the content itself.
Content scraping has become such a problem that search engines now look for duplicate copy, even when it’s hidden behind a link like the Similar Pages that Google uses for related content. If they find it, your site may be lowered in the rankings or even delisted completely.
Still, the duplicate-copy issue isn’t as simple as it may seem. Some people think there’s too much worry about it, whereas others insist the problem needs to be addressed. And both are right to some degree. Let me explain.
First, you need to understand that not all duplicate content is the same kind. You need to appreciate some differences.
Reprints:This is duplicate content published on multiple sites with the permission of the copyright holder. These are the articles that you or others create and then distribute to create links back to your site or to sites that are relevant to the content of yours. Reprints are not bad duplicate content, but they can get your site thrown into the realm of Similar Pages, which means they’ll be buried behind other results.
Site Mirroring: This is the kind of duplication that can cause one or more of your sites to be delisted from a search engine. Site mirroring is literally keeping exact copies of your web site in two different places on the Internet. Web sites used to practice site mirroring all the time as a way to avoid downtime when one site crashed. These days, server capabilities are such that site mirroring isn’t as necessary as it once was, and search engines now “dis-include” mirrored content because of the spamming implications it can have. Spammers have been known to mirror sites to create a false Internet for the purpose of stealing user names, passwords, account numbers, and other personal information.
Content Scraping: Content scraping is taking the content from one site and reusing it on another site with nothing more than cosmetic changes. This is another tactic used by spammers, and it’s also often a source of copyright infringement.
Same Site Duplication: If you duplicate content across your own web site, you could also be penalized for duplicate content. This becomes especially troublesome with blogs, because there is often a full blog post on the main page and then an archived blog post on another page of your site. This type of duplication can be managed by simply using a partial post, called a snippet, that links to the full post in a single place on your web site.
Of these types of duplicate content, two are especially harmful to your site: site mirroring and content scraping. If you’re using site mirroring, you should consider using a different backup method for your web site. If you’re using content scraping you could be facing legal action for copyright infringement. Content scraping is a practice that’s best avoided completely.
Even though reprints and same-site duplication are not entirely harmful, they are also not helpful. And in fact they can be harmful if they’re handled in the wrong way. You won’t win any points with a search engine crawler if your site is full of content that’s used elsewhere on the Web. Reprints, especially those that are repeated often on the Web, will eventually make a search engine crawler begin to take notice.
Once it takes notice, the crawler will try to find the original location of the reprint. It does this by looking at where the content appeared first. It also looks at which copy of an article the most links point to and what versions of the article are the result of content scraping. Through a process of elimination, the crawler narrows the field until a determination can be made. Or if it’s still too difficult to tell where the content originated, the crawler will select from trusted domains.
Once the crawler has determined what content is the original, all of the other reprints fall into order beneath it or are eliminated from the index.
If you must use content that’s not original, or if you must have multiple copies of content on your web site, there is a way to keep those duplications from adversely affecting your search rankings. By using the <robots.txt> or <noindex> tags, you can prevent duplicated pages from being indexed by the search engine.
The <noindex> tag should be placed in the page header for the page that you don’t want to be indexed. It’s also a good idea to allow the crawler that finds the tag to follow links that might be on the page. To do that, your code (which is a meta tag) should look like this:
<meta name=”robots” content=”noindex,follow”>
That small tag of code tells the search engine not to index the page, but to follow the links on the page. This small snippet of code can help you quickly solve the problem of search engines reading your duplicate content.
So in conclusion, my advice would be to avoid any type of duplicate content if your main goal is to achieve high search engine rankings on your website. By providing fresh & unique content on your website, you are not only pleasing the search engine, but more importantly, your pleasing your user, which should be your ultimate goal as a webmaster.
Andy MacDonald, CEO of Swift Media UK, a website design & search marketing company. For daily tips on Blogging, Marketing, SEO & Making Money Online, Checkout our SEO & Marketing Tips for Webmasters blog or Subscribe by RSS.