Duplicate Content Causes SEO Problems in Wordpress

Wordpress by default is very un-search engine friendly. There are many places where content is the same yet the URL is different. Google recognizes this as duplicate content.

Modern search engines like Google penalize for duplicate content. They do this because they want to give credit to the original author and keep spammers or copiers from getting page rank. They also do this to separate actual content and navigation/sidebar/ and other text that is not ranked as higher value. This all helps Google provide better search results.

So whats the big deal?

 
Here is a direct quote from Google’s FAQ: How can I create a Google-friendly site?

Don’t create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you’ll need to block duplicates from our spiders using a robots.txt file.

The problem is that your Wordpress blog by default will not rank well in Google’s search engines because of the duplication problem. Fortunately Google and other search engines have provided us with a tool to inform their search engine spiders to ignore specific content.

Let me give you a few examples of where Wordpress contains the same content throughout multiple URLs.

1. www.yourblog.com has by default that last 10 or so posts. The “original content” maybe in this URL: http://www.yourblog.com/2008/04/title-of-story , but is also on your home page.

2. http://www.yourblog.com/2008/04/ – Your archive pages contain the same content as your main page with a date range.

3. http://www.yourblog.com/category/category-name/ - The same thing goes for your category pages. Every post in the category-name category will be duplicated within this URL.

4. http://www.yourblog.com/feed – All articles in their entirety are duplicated in all of the Wordpress default feeds.

5. http://www.yourblog.com/search – Of course all search results will also be duplicated content.

As you can see these URLs are all different but contain the exact same content. If you don’t want to be penalized by google you need to create a file at the root directory of your blog called, ‘robots.txt‘. This is the file that search engine spiders will be looking for and this is where you specify the rules you want them to follow.

Robots.txt

 
The following robots.txt file will pretty much restrict search engine spiders from most of the duplication problems I can think of.

User-agent: *
Disallow: /wp-
Disallow: /search
Disallow: /feed
Disallow: /comments/feed
Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$

Since we want the bots to follow links in our category, archive pages and certainly the home page we will have to treat them differently. We want search engine spiders to follow the links on these pages yet we don’t want the actual content indexed due to the duplication penalties.

This code will tell the bots to follow links and ignore content. Place this html code in your archive and category pages.

<meta name="robots" content="noindex,follow" />

Another line of defense for fighting against duplication on these pages is to use the Wordpress ‘more‘ function. When you add the more function in your posts Wordpress will cut the article off at that point and offer the reader a “read more” link. Not only does this help with duplication issues it will make navigating through articles much easier for your users.

The more function:

<!-- more -->

Of course Google has many ways to determine page rank and how pages are indexed. We can never be sure how Google weighs them and what other factors determine a page’s quality content. So you may find a piece of content that is duplicated all over the internet but the original is still ranked high because of link popularity or other factors.

Your entire website may also be a duplicate. Check out, Duplicate Content www vs. non-www Canonical Problems.


Was this information useful?


6 Responses to "Duplicate Content Causes SEO Problems in Wordpress"
  1. [...] private. I use robots.txt on this site to keep the search engines away from parts of the site that contain duplicate content. Often times sites exclude private data that they don’t want indexed. This is where the Robot [...]

  2. Squeaky on April 28th, 2009

    I haven’t seen this format being used before. Could you explain it it does and what the difference is?

    Disallow: /*/*/feed/$
    Disallow: /*/*/feed/rss/$
    Disallow: /*/*/trackback/$
    Disallow: /*/*/*/feed/$
    Disallow: /*/*/*/feed/rss/$
    Disallow: /*/*/*/trackback/$

    I have seen it used this way.

    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$
    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$

  3. Pierre F. Walter on August 31st, 2009

    This is the best Wordpress related advice I have seen from any developer or consultant. Thanks so much, it is really precious, and yet I want to tell folks this also. I am on the web with ipublica since 1998 and to this was never ranked properly in Google? Why? It took me years to find out it was because I was using Frontpage and it’s deprecated by Google because of wrong code. Then a few months ago I started with Wordpress and made a test site. Within days, two Google staff posted me appreciative comments on one page, and I saw that page ranked 15th place in Google under the keyword. That was not happening for all those years, so I think Wordpress might not score so badly after all.

    Please tell me Mark, why do you put those /* in front of the commands? I have never seen that in robots files, only the format /foldername or / when it’s the home page. In script language, I have seen that for making explanative comments, or for invalidating a script command.

    Could you kindly explain this to me, please?
    Thanks in advance.
    Pierre

  4. Bill Bartmann on September 1st, 2009

    Excellent site, keep up the good work

  5. Mark Sanborn on September 2nd, 2009

    The asterisk or (*) simply means wildcard. Kinda like poker. It can mean anything inside the slash. So. /*/feed can mean /blah/feed or /foo/feed. The dollar sign is a symbol borrowed from regular expression land (don’t go there unless you are uber geeky) which means that it is the end of the line.

  6. Coortpoosse on January 2nd, 2010

    Many of people talk about this issue but you wrote down some true words!