All About Canonicals

canonical link elements

A canonical link is the HTML element that will aid webmasters in preventing duplicated content mishaps, by specifying the preferred (canonical) version of a web page's link, as part of that web page's search engine optimization. It is a proven problem for search engines when trying to figure out the original source for a document that is available through multiple URLs.

Content duplication can transpire in a number of ways:

Print versions of websites
Accessibility on varying hosts or protocols
Multiple URLs due to content management systems
GET parameters

In short, duplicated content mishaps occur when the same content can be accessed from more than one URL. Search engines will aim to use the canonical link definitions as output filters for search results. If there are multiple URLs with the same content in the results, the canonical URL definitions will be taken into consideration in order to determine what the original source of content for the link is.

Canonical linking elements do prove to be beneficial, but Google actually prefers it when a 301 redirect is utilized. This is due to the fact that Google's spiders may opt to bypass a canonical link element if they see it as more beneficial.

Canonical URLs can be used to improve links and even ranking signals for content that is available through more than one URL structure or through syndication. When thinking about online shopping systems and content management, it is not unusual for one piece of content to be accessed through more than one URL. Content syndication makes it simpler for content to be diffused to varying URLs and entire domains.

A few examples are as follows:

The product page for one item can have dynamic URLs because of the user's session and/or search preferences.
Blog systems can automatically save multiple URLs for the same post, as they are filed under different sections.
Your server is designed to deliver the same content for the http protocol or the www subdomain.
Content that is provided on a blog meant for syndication on other websites will be replicated either in full or in part on those domains.

While all of these things make it easier to develop and then distribute content, they do cause some challenges when individuals use search engines to reach a page.

This can happen when:

A consolidating link will signal for the duplicate or similar content. This helps search engines to work on consolidating the information that they have for each individual URL on one single and preferred URL.
Metrics are tracked for one single product or topic. When there is a variety of URLs, it is more of a challenge to receive consolidated metrics for a particular piece of content.
You determine the URL that you would like people to see. This could mean that the link that people see is a more simplified version that makes more sense when read.
The syndicated content is addressed. When you syndicate your content to be published on other domains. When this is the case, you will want to consolidate the page ranking to the preferred URL.

To address these issues specifically, it is recommended that you define the canonical URL for the content that is available via more than one URL. There are several methods to defining a canonical URL.

Tell Google Your Preferred Domain

You will have to tell Google which version of the website's URL that you prefer to be used for the domain. This can be something like https://www.example.com or https://sample.com. If the preferred domain is set as the latter, Google will treat links to the former in the same way.

Once the preferred domain name has been set, Google will use that information for any future crawls of the website as well as any index refreshes. Google will also take your preferences into consideration when displaying a URL. If you have not specified a preferred domain, Google may treat the www and non-www versions of the same domain as different references to different pages. The change may not be reflected immediately within an index, and all pages that now display the non-preferred version of the URL within the index will stay in the index as such until the index is refreshed.

To specify a preferred domain, you will go to the Search Console Home page, and then click on the site you will be editing. Click on the gear icon, and then Site Settings. Within the Preferred Domain section, select the option that you want.

It is possible that you will need to verify that you own both versions (www and non-www) of the domain. Crawling and indexing will be impacted by setting a preferred domain—this is why validation is needed for both versions. In general, both versions would aim to the same physical location, but this is not the case every time. Once one version of the domain has been verified, Google is able to verify the other using the original verification code. Nonetheless, if you have removed a meta tag, file, or DNS record, the verification process will need to be totally repeated.

Indicating the Preferred URL with a Specific Link Element

When there is one link used as the preferred URL, there still may be a variety of links that will direct to that preferred link. It can be indicated to the search engine by doing the following:

Add in a link element <link> with the rel=”canonical” attribute to the <head> portion of the page.

This will indicate what the preferred URL is to access the particular link, enabling the search results to be more likely to show this link structure when accessed within search results.

To avoid errors, use an absolute path, rather than a relative path by using the rel=”canonical” element.

Using a Sitemap to Set Preferred URLs for the Same Piece of Content

Pick your preferred (canonical) URL for each of your web pages, and relay your preferences by submitting the canonical URLs within a sitemap. Google does not guarantee that they will use the URLs that are submitted through a sitemap, but this is one of the better ways to tell Google about the pages within a website that are considered the most important.

301 Redirects for Non-canonical URLs

Imagine that your web page can be accessed from more than one URL structure. It would be wise to choose one of the URLs to be the canonical (preferred) destination, and then employ 301 redirects that will send users from the other URLs to the preferred URL. A 301 redirect from the server's side is the best way to ensure that search engines and users will be directed to the correct page. A 301 status code indicates that a web page has been permanently moved to a different location.

Indicating How to Handle Dynamic Parameters

You can use Parameter Handling to indicate to Google about any parameters that you have set that you would like to be ignored. When you choose to have certain parameters to be ignored, it is possible to reduce the amount of content that is duplicated within Google's index—this ultimately makes the website easier to crawl. As an example, you can indicate that you'd like a session ID to be ignored within a link.

When Google has detected duplicate content, an algorithm will group duplicate URLs into a cluster and then decide what the algorithm thinks is the best URL structure to represent the group in the search results. Google will then try to consolidate what is known about the URLs within the group, like the popularity of a link, toward the one URL that is representative that will help to improve the accuracy of the page ranking within Google's search results.

When Google is not able to find all of the URLs within a cluster or can't choose the representative URL that is preferred, you can utilize the URL parameters tool from Google in order to share information about how it should handle a URL that contains specific parameters.

NOTE: it is advised to use caution when employing the URL Parameters Tool. If there is a mistake that is made in which content that should not be crawled, Google could end up excluding content from being crawled, unintentionally.

Using Canonical Links in HTTP Headers

If you are able to configure your server, the protocol involves using rel=”canonical” HTTP headers in order to specify the canonical URL for any HTML documents or other files like PDFs. For example, a website can have the same PDF available through several different URLs, you will use the rel=”canonical” header in order to specify to Google what the canonical URL is for the PDF file. These link header elements are supported for only web searches at this time.

HTTP header fields make up a portion of a web page's header that contain the request and response messages. They are what defines the operating parameters of an HTTP transaction. These fields are transmitted after a request or response line—these are the first lines of the message. The fields are separated by colons and contain pairs of name values through a clear-text string type format. This will be ended with a carriage return and line feed sequence. An empty field will indicate the end of a header section.

HTTP vs. HTTPs: What Google Prefers

Google prefers to use HTTPS pages over HTTP pages as far as canonical links go, except when signals are conflicting. A few examples of this are:

The HTTPS page having an invalid SSL certificate.
The HTTPS page containing a metatag of noindex robots.
The HTTPS page will have a rel=”canonical” link to the HTTP page.
The HTTPS page will contain insecure dependencies.
The HTTPS page is roboted while the HTTP page is not.
The HTTPS page will redirect users to an HTTP page.

Though Google's system prefers the HTTPS pages over the HTTP pages automatically, you can promise this behavior by doing one of the following:

Implement 301 or 302 redirects from your HTTP page to your HTTPS page.
Add a rel=”canonical” link leading from the HTTP page to the HTTPS page.
Implementing HSTS.
HSTS stands for HTTP Strict Transport Security, and is an internet security mechanism that aims to help protect websites from cookie hijacking and protocol downgrade attacks. It allows web servers to decide that web browsers should only access it using a secure HTTPS connection.
An HSTS Policy is communicated through the server to the user through an HTTP response header field. This is called “Strict-Transport-Security”. This HSTS Policy will specify a period of time that the user agent will only access the server in a secure way.
An HSTS Policy will help to protect web users from passive and active network attacks.
An initial request will remain unprotected from an active attack if it employs a protocol that is insecure, like an HTTP, or if the URI given for the initial request was obtained through a channel that was not secure.
The major web browsers address limitations of HSTS by using a list that incorporates all known sites that support HSTS.
Though these lists are large, there is no way that they can cover the entire internet.

In order to help keep Google from wrongly making the HTTP page the preferred link, you should avoid doing these things. Bad SSL certificates and HTTP to HTTPS redirects may cause Google to choose the HTTP over the HTTPS. Even utilizing HSTS will not override this preference. It is also advised to include the HTTP page within the sitemap or hreflang portions, rather than in the HTTPS version. Lastly, implement your SSL or TLS certificate for the incorrect host-variant. As an example, sample.com will serve as the certificate for www.sample.com—the certificate has to match the complete website URL, or will be a random certificate that will be employed for more than one subdomain within a domain.

Author: Garenne BigbyWebsite: http://garennebigby.com

Founder of DYNO Mapper and Former Advisory Committee Representative at the W3C.

Create Interactive Visual Sitemaps

Get Started with DYNO Mapper

Join thousands of professionals using the most advanced visual sitemap tool to simplify discovery, IA, and content planning.

👉 Start Your Free Trial — No credit card required.