As we have discussed, duplicate content can be created in many ways. In many cases, the duplicate pages are pages that have no value to either users or search engines. If that is the case, try to eliminate the problem altogether by fixing the implementation so that all pages are referred to by only one URL. Also, 301- redirect the old URLs to the surviving URLs to help the search engines discover what you have done as rapidly as possible, and preserve any link authority the removed pages may have had. If that process proves to be impossible.
Here is a summary of the guidelines on the simplest solutions
- Use robots.txt to block search engine spiders from crawling the duplicate versions of pages on your site.
- Use the rel=”canonical” link element. This is the next best solution to eliminating the duplicate pages.
- Use <meta name=”robots” content=”noindex”> to tell the search engine to not index the duplicate pages.
Be aware, however, that if you use robots.txt to prevent a page from being crawled, then using noindex or nofollow on the page itself does not make sense—the spider can’t read the page, so it will never see the noindex or nofollow.
Duplicate content in blogs and multiple archiving systems (e.g., pagination)
Blogs present some interesting duplicate content challenges. Blog posts can appear on many different pages, such as the home page of the blog, the permalink page for the post, date archive pages, and category pages. Each instance of the post represents duplicates of the other instances. Few publishers attempt to address the presence of the post on the home page of the blog and also at its permalink, and this is common enough that the search engines likely deal reasonably well with it. However, it may make sense to show only excerpts of the post on the category and/or date archive pages.
How Search Engines Identify Duplicate Content
- Where Google first saw the content.
- Trust in the domain
- Where most links point
- Whether links on copies point back to an original
- Which version appears to have been through “scraping and repurposing
- If its close – Pagerank.
There are a few facts about duplicate content that bear mentioning, as they can trip up webmasters who are new to the duplicate content issue:
Location of the duplicate content
Is it duplicated content if it is all on my site? Yes, in fact, duplicate content can occur within a site or across different sites.
Percentage of duplicate content
What percentage of a page has to be duplicated before I run into duplicate content filtering? Unfortunately, the search engines would never reveal this information because it would compromise their ability to prevent the problem. It is also a near certainty that the percentage at each engine fluctuates regularly and that more than one simple direct comparison goes into duplicate content detection. The bottom line is that pages do not need to be identical to be considered duplicates.
Ratio of code to text
What if you web page code is huge and there are very few unique HTML elements on the page? Will Google think the pages are all duplicates of one another? No. The search engines do not care about your code; they are interested in the content on your page. Code size becomes a problem only when it becomes extreme.
Ratio of navigation elements to unique content
Every page on site has a huge navigation bar, lots of header and footer items, but only a little bit of content; will Google think these pages are duplicates? No. Google and Bing factor out the common page elements, such as navigation, before evaluating whether a page is a duplicate. They are very familiar with the layout of websites and recognize that permanent structures on all (or many) of a site’s pages are quite normal. Instead, they’ll pay attention to the “unique” portions of each page and often will largely ignore the rest. Note, however, that these will almost certainly be considered thin content by the engines.
What should I do if I want to avoid duplicate content problems, but I have licensed content from other web sources to show my visitors? Use meta name = “robots” content=”noindex, follow”. Place this in your page’s header and the search engines will know that the content isn’t for them. This is a general best practice, because then humans can still visit and link to the page, and the links on the page will still carry value. Another alternative is to make sure you have exclusive ownership and publication rights for that content.
An actual penalty situation
Websites that take contents from across the Web can be at risk, you might see the site actually penalized. If you find yourself in this situation, the only fixes are to reduce the number of duplicate pages accessible to the search engine crawler. You can accomplish this by deleting them. Use canonical on the duplicates, noindex-ing the pages themselves, or adding a substantial amount of unique content.
For affiliate Websites
The problem arises when a merchant has thousands of affiliates generally promoting websites using the same descriptive content, and search engineers have observed user data suggesting that, from a searcher’s perspective, these sites add little value to their indexes. Thus, the search engines attempt to filter out this type of site, or even ban it from their index. Plenty of sites operate affiliate models but also provide rich new content, and these sites generally have no problem; it is when duplication of content and a lack of unique, value-adding material come together on a domain that the engines may take action.
Visit Web design company in Kolkata