Duplicate Content: Why It May Not Matter
Why don't you subscribe to my blog while you're here? I'm a freelance web developer and I blog about Ruby, Rails, and business online.
Go ahead and subscribe to my RSS feed. Thanks for visiting!
A concern among many SEO and SEM experts is that search engines may penalize duplicate content (web pages ripped from PLR, open publishing sites). After learning about the internals of search engine algorithms I suspect dup content is not an issue for any search engine.
There are three generic information retrieval techniques: boolean, probabilistic, and vector space models. Vector space models appear to be the current hot topic, specifically with latent semantic indexing (LSI) leading the charge. With vector space models you project the data in to a set of co-ordinates and then use the co-ordinates as a metric to compare the data against itself. Duplicate content is projected to the same co-ordinate. In general, similar content is projected close to similar content, and away from dissimilar content.
Given that duplicate content will be located at the same co-ordinate, a search engine would simply continue to rank the content by other criteria such as page rank, domain authority, etc. So I guess if someone is ripping your content and they are considered more authoritative - you have a problem; but I suspect this would be a very unusual situation.
[Also, I had read on Matt Cutt's blog that you should not be concerned about dup content though I can't find the blog post.]


June 18th, 2007 at 10:56 pm
Why do you say duplicates do not matter iin semantic search; that is just silly. In fact, duplicates matter very much.
The problem is ocntent is copied from blog to blog to website to one or more agregators…. ‘nuf said.
June 19th, 2007 at 9:52 pm
My post was meant to illustrate why duplicate content may not matter for a web site using PLR, and other copied material. There is a concern among some people that sites with duplicate content may be penalized. I tried to suggest in the post that search engine algorithms inherently identify duplicate content and can work around the fact without explicitly penalizing a site.