How to Block Access to Your Website Content

Last Edited September 11, 2023 by Garenne Bigby in Search Engine Optimization

When blocking a URL on your website, you are able to stop Google from indexing certain web pages with the purpose of being displayed in Google's Search Engine Results. This means that when people are looking through the search results, they will not be able to see or navigate to a URL that has been blocked, and they will not see any of its content. If there are pages of content that you would like to refrain from being seen within Google's Search Results, there are a few things you can do to complete this task.

Control What is Being Shared With Google

Most people might not give this a second thought, but it there are a few reasons that someone would want to hide any amount of content from Google.

You can keep your data secure. It is possible that you'd have a lot of private data that is present on your website that you'd like to keep out of users' reach. This could be things like contact information for members. This type of information needs to be blocked from Google so that the members' data is not being shown in Google's Search Results pages.

Getting rid of content from a third party. It is possible for a website to share information that is rendered by a third party source, and is likely available other places on the internet. When this is the case, Google will see less value in your website when it contains large amounts of duplicate content within Google's Search Results. You will be able to block the duplicate content in order to improve what Google will see, thus boosting your page within Google's Search Results.

Hide less valuable content from your website visitors. If your website has the same content on multiple places on the site, this could have a negative impact on the rankings you get with Google Search. You can perform a site-wide search in order to get a good idea of where your duplicate content could be, and understand how this related to users and how they navigate the website. Some search functions will generate and show a custom search results page each time that a user enters a search query. Google will crawl all of these custom search results pages one by one when they are not blocked. Because of this, Google will be able to see a website that contains many similar pages, and would actually categorize this duplicate content as spam. This results in Google Search pushing this website further down the list in the Search Results pages.

Blocking URLs Using Robots.txt

Robots.txt files are located at the root of the site that will indicate the portion(s) of the site that you do not want search engine crawlers to access. It uses the “Robots Exclusion Standard”—a protocol that contains a small set of commands that will indicate where web crawlers are allowed to access.

This can be used for web pages, and should be used only for controlling crawling so that the server isn't overwhelmed by going through duplicate content. Keeping this in mind, it should NOT be used to hide pages from Google's Search results. Other pages could point to your page, and the page will be indexed as such, totally disregarding the robots.txt file. If you'd like to block pages from the search results, there are other methods, like password protection.

Robots.txt will also prevent image files from showing up in Google search results, but it does not disallow other users from linking to the specific image.

The limitations of robots.txt should be known before you build the file, as there are some risks involved. There are other mechanisms available to make sure that URLs are not findable on the web.
- The instructions given by robots.txt are only directives. They are not able to enforce crawler behavior, and only point them in the right direction. Well known crawlers like Googlebot will respect the instructions given, others might not.
- Each crawler will interpret syntax differently. Though as stated before, the well-known crawlers will obey the directives, each of the crawlers could interpret the instructions differently. It is vital to know the proper syntax for addressing the web crawlers.
- Robots.txt directives are not able to prevent references to your links from other sites. Google is good about following directives from robots.txt, but it is possible that they will still find and then index a blocked URL from somewhere else on the web. Because of this, links and other publicly available information may still show up in the search results.

NOTE: know that when you combine more than one directive for crawling and indexing may cause the directives to counteract each other.

Learn how to create a robots.txt file. First, you will need access to the root of the domain. If you don't know how to do this, contact your web host.

The syntax associated with robots.txt matters greatly. In its most simple form, the robots.txt file will use two keywords—Disallow and user-agent. The term Disallow is a command aimed at the user-agent that will tell them that they should not be accessing this particular link. User-agents are web crawler software, and most of them are listed online. Opposite of this, to give user-agents access to a specific URL that is a child directory in a parent directory that has been disallowed, you will use the Allow term to grant access.

Google's user-agents include Googlebot (for Google Search) and Googlebot-Image (for image search). Most of the user-agents will follow the rules that have been set up for your site, but they can be overrode by making special rules for specific Google user-agents.
- Allow: this is the URL path within a subdirectory that has a blocked parent directory that you'd like to unblock.
- Block: this is the URL path that you would like to block.
- User-agent: this is the name of the robot that the previous rules will apply to.

When user-agent and allow or disallow are together, it is considered to be a single entry in a file where the action will only be applied to the specified user agent. If you'd like to direct this to multiple user-agents, list an asterisk (*).

You will then need to make sure that your robots.txt file is saved. Make sure that you do the following so that web crawlers will be able to find and identify your file.

Save the robots.txt file as a text file.
Place the file within the highest-level directory of the website (or in the root of the domain).
The file has to be named robots.txt.
- Example: a robots.txt file that is saved at the root of sample.com with the URL of http://www.sample.com/robots.txt is discoverable by web crawlers, but if the robots.txt file is located at a URL like http://www.sample.com/not_root/robots.txt it will not be able to be found by web crawlers.

There is a Testing tool specifically for robots.txt, and it will show you if the file is successfully blocking Google's web crawlers from accessing specific links on your site. The tool is able to operate exactly like Googlebot does, and verifies that everything is working properly.

To test the file, follow these instructions:

Open the testing tool for the site, and scroll through the code to find the logic errors and syntax warnings that will be highlighted.
Enter the URL of a page on your website into the text box that is located at the bottom of the page.
Select which user-agent you'd like to simulate. This will be located in the drop down menu.
Select the TEST button.
The button will read either Accepted or Blocked, indicating if the URL has been successfully blocked from web crawlers.
As necessary, you will need to edit the file and then retest it. NOTE: the changes made on this page are not saved to your site! You will need to take additional action.
You will need to copy the changes to the robots.txt file within your own site.

There are some limitations to the robots.txt testing tool. Know that the changes that have been made within the tool are not saved automatically to your own web server. You will have to copy the changes as described previously. The tester tool will also only text the file with Google's user-agents or crawlers like Googlebot. Google is not responsible for how other web crawlers interpret the robots.txt file.

Finally, you will submit the file once it has been edited. Within the editor, click on Submit. Download your code from the tester page, and then upload it to the root of the domain. Verify, and then submit the live version.

Blocking URLs Through Directories That Are Password Protected

When there is private information or content that you do not want included in Google's search results, this is the most effective way to block private links. You should store them within a password protected directory located on your website's server. All web crawlers will be blocked from having access to this content contained within the protected directories.

Blocking Search Indexing with Meta Tags

It is possible to block a page from appearing in Google Search when you include the noindex meta tag in your web page's HTML coding. Googlebot will crawl that page and see the meta tag, and then will totally drop that page from the search results- even if other websites link to it. NOTE: in order for this meta tag to work, the page cannot be hindered by a robots.txt file. If it is blocked by this file, crawlers will not see the noindex meta tag and might still come through in the search results if other pages link to it.

The noindex tag is very useful when you do not have access to the root of your server, as this is the area that allows control over the website through each page individually. If you'd like to prevent most search engines from indexing a specific page on your website, use the meta tag <meta name=”robots” content=”noindex”> into the <head> section of the page. If you'd like to prevent only Google from indexing the page, exchange “robots” for “googlebot”. Various search engine crawlers will interpret the noindex instructions differently, and it is possible that the page could still appear in search results from some search engines.

You can actually help Google to spot your meta tags when blocking access from certain content. Because they have to crawl the page in order to actually see the meta tags, it is possible that Googlebot will miss the noindex tag. If you know that a page that you've tried to block is still showing up in search results, it may be possible that Google has not crawled the site yet since the tag has been added. You will need to send a request for Google to crawl the page by implementing the Fetch as Google tool. If the content is still showing up, it is possible that the robots.txt file is stopping the link from being seen in Google's web crawlers, so the tag can't be seen. If you'd like to unblock the page from Google, you will need to edit the robots.txt file- which can be done right from the robots.txt testing tool.

Opt Out of Displaying on Google Local and Similar Google Properties

It is possible for you to have your content blocked from being displayed on varying Google properties after it has been crawled. This would include Google Local, Google Hotels, Google Flights, and Google Shopping. When you choose to opt out of being displayed on these outlets, the content that has been crawled won't be listed on them. Any content that is being displayed currently on any of these platforms will be removed in no more than 30 days from opting out. When opting out of Google Local, you'll be opted out globally. For the other properties of Google, the opt-out will apply to the services that are hosted on Google's domain.

Author: Garenne BigbyWebsite: http://garennebigby.com

Founder of DYNO Mapper and Former Advisory Committee Representative at the W3C.

Back

Create Visual Sitemaps

Create, edit, customize, and share visual sitemaps integrated with Google Analytics for easy discovery, planning, and collaboration.

Popular Tags

Search Engine Optimization SEO Accessibility Testing Create Sitemaps Sitemaps UX User Experience Sitemap Generator Content Audit Visual Sitemap Generator