A canonical link is the HTML element that will aid webmasters in preventing duplicated content mishaps, by specifying the preferred (canonical) version of a web page's link, as part of that web page's search engine optimization. It is a proven problem for search engines when trying to figure out the original source for a document that is available through multiple URLs.
Content duplication can transpire in a number of ways:
In short, duplicated content mishaps occur when the same content can be accessed from more than one URL. Search engines will aim to use the canonical link definitions as output filters for search results. If there are multiple URLs with the same content in the results, the canonical URL definitions will be taken into consideration in order to determine what the original source of content for the link is.
Canonical linking elements do prove to be beneficial, but Google actually prefers it when a 301 redirect is utilized. This is due to the fact that Google's spiders may opt to bypass a canonical link element if they see it as more beneficial.
Canonical URLs can be used to improve links and even ranking signals for content that is available through more than one URL structure or through syndication. When thinking about online shopping systems and content management, it is not unusual for one piece of content to be accessed through more than one URL. Content syndication makes it simpler for content to be diffused to varying URLs and entire domains.
A few examples are as follows:
While all of these things make it easier to develop and then distribute content, they do cause some challenges when individuals use search engines to reach a page.
This can happen when:
To address these issues specifically, it is recommended that you define the canonical URL for the content that is available via more than one URL. There are several methods to defining a canonical URL.
You will have to tell Google which version of the website's URL that you prefer to be used for the domain. This can be something like https://www.example.com or https://sample.com. If the preferred domain is set as the latter, Google will treat links to the former in the same way.
Once the preferred domain name has been set, Google will use that information for any future crawls of the website as well as any index refreshes. Google will also take your preferences into consideration when displaying a URL. If you have not specified a preferred domain, Google may treat the www and non-www versions of the same domain as different references to different pages. The change may not be reflected immediately within an index, and all pages that now display the non-preferred version of the URL within the index will stay in the index as such until the index is refreshed.
To specify a preferred domain, you will go to the Search Console Home page, and then click on the site you will be editing. Click on the gear icon, and then Site Settings. Within the Preferred Domain section, select the option that you want.
It is possible that you will need to verify that you own both versions (www and non-www) of the domain. Crawling and indexing will be impacted by setting a preferred domain—this is why validation is needed for both versions. In general, both versions would aim to the same physical location, but this is not the case every time. Once one version of the domain has been verified, Google is able to verify the other using the original verification code. Nonetheless, if you have removed a meta tag, file, or DNS record, the verification process will need to be totally repeated.
When there is one link used as the preferred URL, there still may be a variety of links that will direct to that preferred link. It can be indicated to the search engine by doing the following:
This will indicate what the preferred URL is to access the particular link, enabling the search results to be more likely to show this link structure when accessed within search results.
To avoid errors, use an absolute path, rather than a relative path by using the rel=”canonical” element.
Pick your preferred (canonical) URL for each of your web pages, and relay your preferences by submitting the canonical URLs within a sitemap. Google does not guarantee that they will use the URLs that are submitted through a sitemap, but this is one of the better ways to tell Google about the pages within a website that are considered the most important.
Imagine that your web page can be accessed from more than one URL structure. It would be wise to choose one of the URLs to be the canonical (preferred) destination, and then employ 301 redirects that will send users from the other URLs to the preferred URL. A 301 redirect from the server's side is the best way to ensure that search engines and users will be directed to the correct page. A 301 status code indicates that a web page has been permanently moved to a different location.
You can use Parameter Handling to indicate to Google about any parameters that you have set that you would like to be ignored. When you choose to have certain parameters to be ignored, it is possible to reduce the amount of content that is duplicated within Google's index—this ultimately makes the website easier to crawl. As an example, you can indicate that you'd like a session ID to be ignored within a link.
When Google has detected duplicate content, an algorithm will group duplicate URLs into a cluster and then decide what the algorithm thinks is the best URL structure to represent the group in the search results. Google will then try to consolidate what is known about the URLs within the group, like the popularity of a link, toward the one URL that is representative that will help to improve the accuracy of the page ranking within Google's search results.
When Google is not able to find all of the URLs within a cluster or can't choose the representative URL that is preferred, you can utilize the URL parameters tool from Google in order to share information about how it should handle a URL that contains specific parameters.
NOTE: it is advised to use caution when employing the URL Parameters Tool. If there is a mistake that is made in which content that should not be crawled, Google could end up excluding content from being crawled, unintentionally.
If you are able to configure your server, the protocol involves using rel=”canonical” HTTP headers in order to specify the canonical URL for any HTML documents or other files like PDFs. For example, a website can have the same PDF available through several different URLs, you will use the rel=”canonical” header in order to specify to Google what the canonical URL is for the PDF file. These link header elements are supported for only web searches at this time.
HTTP header fields make up a portion of a web page's header that contain the request and response messages. They are what defines the operating parameters of an HTTP transaction. These fields are transmitted after a request or response line—these are the first lines of the message. The fields are separated by colons and contain pairs of name values through a clear-text string type format. This will be ended with a carriage return and line feed sequence. An empty field will indicate the end of a header section.
Google prefers to use HTTPS pages over HTTP pages as far as canonical links go, except when signals are conflicting. A few examples of this are:
Though Google's system prefers the HTTPS pages over the HTTP pages automatically, you can promise this behavior by doing one of the following:
In order to help keep Google from wrongly making the HTTP page the preferred link, you should avoid doing these things. Bad SSL certificates and HTTP to HTTPS redirects may cause Google to choose the HTTP over the HTTPS. Even utilizing HSTS will not override this preference. It is also advised to include the HTTP page within the sitemap or hreflang portions, rather than in the HTTPS version. Lastly, implement your SSL or TLS certificate for the incorrect host-variant. As an example, sample.com will serve as the certificate for www.sample.com—the certificate has to match the complete website URL, or will be a random certificate that will be employed for more than one subdomain within a domain.