Controlling Crawling and Indexing by Search Engines
- Last Edited April 19, 2026
- by Garenne Bigby
Controlling what search engines and AI crawlers access on your site used to be a narrow task — configure a robots.txt, maybe add a noindex meta tag, done. In 2026, the surface area is larger: you may also need to decide whether to allow GPTBot, ClaudeBot, Google-Extended, and a growing list of AI training crawlers, each with its own user-agent token. This guide covers the mechanisms still supported by major search engines and crawlers, the ones that were deprecated along the way, and the new AI-crawler considerations.
Crawling vs Indexing: The Key Difference
Crawling is a bot fetching a page — literally visiting the URL, reading the response, and following links. Indexing is a search engine deciding that page is worth keeping in its index so it can appear in results. They are independent operations, and the distinction matters:
- A page can be crawled but not indexed (Google read it but decided not to include it in search results).
- A page can be indexed without ever being crawled, if Google has enough external signals (backlinks, anchor text) pointing at it. This is the surprise that trips up people who
Disallowa URL in robots.txt expecting it to disappear from search — robots.txt blocks crawling, not indexing.
If your goal is to keep a page out of search results, you need to allow crawling and return a noindex directive. If your goal is just to save crawl budget on junk URLs, robots.txt is the right tool.
Never Use These Methods for Private Content
None of the methods below are security. robots.txt is a publicly readable file; noindex directives still allow crawling (and the page is still reachable directly). If the content is actually private — user records, financial data, internal dashboards — use authentication, not crawler directives. Password protection, server-side access control, or IP allow-lists are the right tools.
Robots.txt: The Basics
A robots.txt file lives at the top level of your domain — example.com/robots.txt — and tells compliant crawlers which paths they should and should not fetch. The Robots Exclusion Protocol was formalized as RFC 9309 in September 2022, a decade after the format was already universal in practice. Google, Bing, major SEO crawlers, and most AI-training crawlers honor it.
A few things to know:
- The file must be UTF-8 or ASCII plain text. Word processors add invisible formatting that breaks parsing.
- URLs in directives are case-sensitive.
- Google dropped support for robots.txt over FTP in 2021. HTTP and HTTPS are the only supported protocols today.
- The most-specific matching user-agent block wins when multiple blocks apply to a crawler.
- Google deprecated the unofficial
noindexdirective in robots.txt on September 1, 2019. If your file still containsnoindex:lines, they are now ignored — move those rules tonoindexmeta tags or X-Robots-Tag headers instead.
Robots.txt Examples
Allow all crawling (equivalent to having no file at all):
User-agent: *
Disallow:Disallow the entire site:
User-agent: *
Disallow: /Disallow specific paths:
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Disallow: /*.pdf$Allow one crawler, block everything else:
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /Block a specific crawler, allow everything else:
User-agent: BadBot
Disallow: /
User-agent: *
Allow: /Include a Sitemap: directive at the end pointing to your XML sitemap so crawlers can find your full URL list quickly:
Sitemap: https://example.com/sitemap.xmlRobots Meta Tag
For per-page control of indexing (rather than per-path control of crawling), use a robots meta tag in the HTML <head>. The correct syntax uses content=, not value= — a common typo that does nothing:
<meta name="robots" content="noindex">Common directives you can combine in content:
noindex— do not include this page in the index.nofollow— do not follow any links on this page.noarchive— do not store a cached copy.nosnippet— do not show a snippet in search results.max-snippet:N,max-image-preview:[none|standard|large],max-video-preview:N— fine-grained control over snippet appearance.unavailable_after:DATE— stop indexing after a given date (useful for time-bound content).
To target a specific crawler, replace robots with the crawler’s user-agent token (for example <meta name="googlebot" content="noindex">).
X-Robots-Tag HTTP Header
For non-HTML content — PDFs, images, video, API responses — you cannot add a meta tag. Use the X-Robots-Tag HTTP response header instead. It accepts the same directives as the robots meta tag:
X-Robots-Tag: noindex, nofollowApache example for all PDFs:
<FilesMatch "\.pdf$">
Header set X-Robots-Tag "noindex"
</FilesMatch>nginx example:
location ~* \.pdf$ {
add_header X-Robots-Tag "noindex";
}Blocking AI Crawlers
The biggest change to this topic since 2023 is the rise of AI training and answer-engine crawlers. Each major AI provider has published a user-agent string and honors robots.txt opt-outs. The common ones worth knowing:
- GPTBot — OpenAI’s training crawler. Controls what ChatGPT models see in future training runs. Does not control what ChatGPT’s browsing features fetch at query time.
- ClaudeBot — Anthropic’s training crawler.
- PerplexityBot — Perplexity’s answer-engine crawler.
- CCBot — Common Crawl. Used as input by many open-source and commercial AI training pipelines.
- Google-Extended — Google’s product-level opt-out. Does not affect Search indexing, but does control whether your content feeds Gemini and Vertex AI.
- Applebot-Extended — Apple’s equivalent for Apple Intelligence.
- Bytespider — ByteDance (TikTok) crawler. Widely blocked for aggressive behavior.
To opt out of AI training without affecting search indexing:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /New AI crawlers appear frequently. Sites like Dark Visitors and Cloudflare’s AI Audit maintain updated lists of user-agent tokens so you can extend your robots.txt as the landscape changes.
Google’s Crawlers in 2026
Google runs many crawlers beyond the main Googlebot. See Google’s official crawler documentation for the authoritative list; the common ones:
- Googlebot — main crawler. Since mobile-first indexing was completed in 2023, most crawling is done by Googlebot Smartphone; Googlebot Desktop still runs but sees a smaller share of traffic.
- Googlebot Image / Googlebot Video / Googlebot News — vertical-specific crawlers for Image Search, Video, and News.
- AdsBot-Google / AdsBot-Google-Mobile — checks landing pages for Google Ads quality scoring. Does not obey the generic
User-agent: *rule; you have to target it by name. - Google-Extended — AI training opt-out (see above).
- APIs-Google — delivers push notifications via HTTP POST to apps registered for them. Narrower scope than the other crawlers.
- Google-InspectionTool — used by Search Console’s URL Inspection.
All Googlebot variants should be verified via reverse DNS lookup if you are making decisions based on user-agent alone — the string is trivial to spoof. A genuine Googlebot request resolves to a googlebot.com or google.com hostname.
Frequently Asked Questions
Does robots.txt keep pages out of Google?
No. Robots.txt prevents crawling, but a URL that Google cannot crawl can still be indexed if external links point to it — it appears in search results with no snippet and the label “No information is available for this page.” To keep a page out of the index entirely, allow crawling and return a noindex directive in the meta tag or X-Robots-Tag header.
Can I use noindex in robots.txt?
Not anymore. Google deprecated the unofficial noindex: directive in robots.txt on September 1, 2019. Use the robots meta tag or X-Robots-Tag header instead.
Do AI crawlers obey robots.txt?
The major ones do — GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended, Applebot-Extended all publish user-agent tokens and honor Disallow directives. Smaller or scraper-grade crawlers often do not. For reliable blocking of aggressive bots, pair robots.txt with rate-limiting, a Web Application Firewall (Cloudflare, Fastly), or Cloudflare’s AI Audit blocking features.
Should I block AI crawlers?
It depends on your content strategy. Allowing them means your writing may inform future LLM outputs — with or without attribution. Blocking them removes you from training data but does not remove you from AI Overviews or ChatGPT’s real-time browsing (those use different crawlers or real-time fetches). Many publishers block training crawlers while allowing search crawlers so they stay in SERPs but not in model training sets.
Bottom Line
Controlling what gets crawled and indexed in 2026 is a three-layer job: robots.txt for path-level crawling decisions, meta tags and X-Robots-Tag headers for per-page and per-resource indexing decisions, and explicit AI-crawler user-agent rules for anything you want kept out of training data. Get the layers right, remember that none of them substitute for real authentication on private content, and skip the dead tactics — robots.txt noindex:, FTP-hosted robots.txt, value="noindex" meta tags — that outdated guides still propagate.
Categories
- Last Edited April 19, 2026
- by Garenne Bigby