DYNO Mapper

Home / Blog / Search Engine Optimization / All About the Robots.txt File

All About the Robots.txt File

The robots.txt file is one of the oldest and most powerful tools in technical SEO — a simple text file at the root of your site that tells web crawlers which parts of your site they can and can’t access. It sounds straightforward, but robots.txt is also one of the easiest files to misconfigure in ways that quietly destroy your search visibility.

A lot has changed since this guide was first written. In 2019, Google deprecated the noindex directive in robots.txt. In 2022, the Internet Engineering Task Force (IETF) ratified the Robots Exclusion Protocol as RFC 9309, making robots.txt an official internet standard after 28 years of de facto use. And in 2024, a wave of AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and others) made robots.txt relevant again as the main tool for controlling whether your content is used to train AI models and surface in AI search products.

This guide covers everything you actually need to know about robots.txt in 2026 — what it does, how to create and test one, the modern directives (and deprecated ones to avoid), AI bot management, and the common mistakes that silently hurt rankings. For the broader picture of how crawling fits into indexing and ranking, see our crawlability vs. indexability guide.

Robots.txt

Is a Robots.txt File Important?

Yes — and the answer matters more in 2026 than it did in 2019. Robots.txt is now the primary mechanism for:

  • Controlling which pages search engine bots crawl (to manage crawl budget on large sites)
  • Preventing low-value sections of your site from being crawled at all
  • Pointing crawlers to your sitemap
  • Managing AI crawler access — deciding whether OpenAI’s GPTBot, Anthropic’s ClaudeBot, Perplexity’s PerplexityBot, and Common Crawl’s CCBot can use your content for model training and real-time retrieval

But robots.txt is not a security mechanism. Anyone can read your robots.txt (it’s a public file at yoursite.com/robots.txt), and a well-behaved crawler respects it — but malicious bots can and do ignore it. For actually keeping content private, use authentication, not robots.txt.

How to Tell If You Have a Robots.txt File

The fastest check: visit https://yoursite.com/robots.txt in your browser. If a text file loads, you have one. If you get a 404, you don’t — and crawlers will assume all content is crawlable by default.

In Google Search Console, the Settings → robots.txt report (rolled out in 2023, replacing the legacy robots.txt Tester) shows the latest version Google has fetched, when it was last fetched, whether Google could parse it, and any warnings. This is the authoritative view of what Google actually sees.

Reasons to Have a Robots.txt File

  • Protect crawl budget on large sites by excluding URL parameters, filter combinations, internal search results, and other low-value URLs from being crawled repeatedly.
  • Keep staging and development environments out of search (alongside authentication).
  • Point crawlers to your sitemap with a Sitemap: directive.
  • Control AI training and retrieval access — one of the most important uses of robots.txt in 2026.

When Not to Have a Robots.txt File

Very small sites — a single-page site, a tiny blog, a new project — often don’t need one. If you have nothing specific to block and no sitemap yet, an empty or missing robots.txt is fine. Google defaults to “crawl everything” in the absence of a robots.txt, which is usually what you want for a small site.

How to Create a Robots.txt File

Robots.txt is a plain text file at the root of your domain (https://yoursite.com/robots.txt). Every CMS and most hosting environments make it easy to create:

  • WordPress: Yoast SEO, RankMath, and SEO Framework plugins all include robots.txt editors. Without a plugin, WordPress auto-generates a basic virtual robots.txt.
  • Shopify: Preferences → Edit robots.txt.liquid (added in 2021 — before that Shopify sites used a locked default).
  • Static sites: add robots.txt to your root directory during build.
  • Custom servers: create the file manually at your web root.

Basic Robots.txt Syntax

The format is simple: groups of rules, one group per user-agent (bot). Each rule is Disallow: or Allow: followed by a URL path.

User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yoursite.com/sitemap.xml

Breakdown:

  • User-agent: * — applies to all bots.
  • Disallow: /path/ — blocks this path and everything under it.
  • Allow: /path/file — explicitly allows a specific URL inside a disallowed directory.
  • Sitemap: — points crawlers to your XML sitemap (absolute URL, one per line, can repeat).

You can also target specific bots with their own groups:

User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /

User-agent: *
Disallow: /admin/

Wildcards and Pattern Matching

Google, Bing, and most modern crawlers support two wildcards in paths:

  • * — matches any sequence of characters.
  • $ — matches the end of a URL.

Useful patterns:

Disallow: /*?   # blocks URLs with query strings
Disallow: /*.pdf$   # blocks PDF files
Allow: /blog/*.jpg$   # allows only .jpg files in /blog/

Deprecated: noindex in Robots.txt

Google officially dropped support for noindex in robots.txt in September 2019. If you have rules like Noindex: /path/ in your file, they’re being ignored by Google. For removing a page from the index, use the <meta name="robots" content="noindex"> meta tag or an X-Robots-Tag HTTP header instead.

Crawl-Delay: Not Supported by Google

Another common misconception: Google has never supported the Crawl-delay: directive in robots.txt. Bing and Yandex honor it, but for Google you manage crawl rate through Search Console (Settings → Crawl rate). If you include Crawl-delay: 5, Bingbot and Yandexbot respect it; Googlebot ignores it.

Managing AI Crawlers

In 2026, the most common reason to edit your robots.txt is to control AI crawler access. Each major AI product uses a distinct user-agent:

  • GPTBot — OpenAI’s crawler for training ChatGPT models
  • OAI-SearchBot — OpenAI’s real-time crawler for ChatGPT Search answers
  • ClaudeBot — Anthropic’s crawler for training Claude models
  • Claude-Web / ClaudeBot — Anthropic’s retrieval crawler
  • PerplexityBot — Perplexity’s crawler
  • Google-Extended — controls whether Google uses your content for Bard/Gemini training and Vertex AI (separate from regular Googlebot)
  • CCBot — Common Crawl, used as a training data source by many AI products
  • FacebookBot / Meta-ExternalAgent — Meta’s AI training crawler

If you want to block all AI training:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Note that blocking Google-Extended does not affect regular Googlebot or your Google Search rankings — it only controls Google’s AI product training. Similarly, blocking GPTBot doesn’t affect your appearance in regular search. This gives you fine-grained control: you can stay indexed in Google Search while opting out of AI training.

How to Test Your Robots.txt File

Testing before deploying changes is essential — a bad line can accidentally block your entire site from Google.

Google Search Console robots.txt report (Settings → robots.txt) is the current authoritative tool. It shows the latest version Google fetched, any parsing warnings, and lets you test whether a specific URL is blocked. This replaced the standalone robots.txt Tester that used to live under “Legacy tools and reports.”

Third-party testers — TechnicalSEO.com’s robots.txt tester, SiteBulb, and Screaming Frog can all validate robots.txt syntax and test URL-level access. Run your changes through at least one tester before publishing.

Manual check — fetch yoursite.com/robots.txt in your browser after updates to confirm it deployed and has no syntax errors.

Common Robots.txt Mistakes to Avoid

A misconfigured robots.txt is one of the most damaging technical SEO mistakes you can make. Things to avoid:

Disallowing the entire site accidentally. A line like Disallow: / under User-agent: * blocks every bot from crawling any URL. This happens surprisingly often during development-to-production pushes (the staging robots.txt gets deployed by mistake). Always check your live robots.txt after a deploy.

Blocking CSS and JavaScript. Old SEO guides sometimes recommended blocking /wp-content/ or JS directories for “cleaner indexing.” That advice is wrong for modern Google. Googlebot needs to fetch CSS and JavaScript to render your pages properly — blocking them means Google can’t see what users see, which hurts rankings. Allow CSS/JS unless you have a specific reason to block.

Using robots.txt to hide content from search. Blocking a URL in robots.txt prevents crawling but not indexing. Google can still index URLs it finds through backlinks, showing them in results with a “no information available” label. To actually keep a page out of the index, use a noindex meta tag (which requires the page to be crawlable so Google can see the tag).

Relying on robots.txt for security. Robots.txt is public. Listing sensitive directories in it is like drawing attention to them. For real privacy, use authentication.

Ignoring case sensitivity. Robots.txt paths are case-sensitive: Disallow: /Admin/ doesn’t block /admin/. If your site has mixed-case URLs, account for both.

Not updating sitemap declarations. If you move or restructure your sitemap, update the Sitemap: line in robots.txt. Stale references waste crawler discovery time.

Frequently Asked Questions

Do I need a robots.txt file?
Technically no — sites without a robots.txt are crawled by default. But most sites benefit from one: even a minimal robots.txt with a Sitemap: directive and a few sensible Disallow rules (admin paths, cart pages, internal search) is worth having. The bar for creating one is low.

Should I block AI crawlers from my site?
It depends on your goals. Blocking AI training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) prevents your content from being used to train models — but may also reduce your visibility in AI search products. Allowing them can drive citations in ChatGPT, Claude, and Perplexity answers, which is increasingly valuable referral traffic. Many publishers allow retrieval bots (which cite sources) and block training bots (which don’t). Read each bot’s published policy before blocking.

Can robots.txt remove a page from Google’s index?
No. Robots.txt prevents crawling, not indexing. A URL blocked in robots.txt can still appear in search results if Google discovers it through backlinks. To remove a page from the index, use a noindex meta tag or the URL Removal tool in Google Search Console. For more on the difference, see our crawlability vs. indexability guide.

What happens if my robots.txt is temporarily unreachable?
Google’s documented behavior: if robots.txt returns a server error (5xx) for more than 30 days, Google assumes the site has no restrictions and crawls everything. If it returns a 4xx (not found), Google also treats it as “crawl everything.” Intermittent errors are tolerated for about 30 days — Google uses its last successfully-fetched version during that window.

Bottom Line

Robots.txt is a simple file that does a lot of heavy lifting. In 2026, it’s both the oldest technical SEO tool (first proposed in 1994) and the newest (the primary mechanism for controlling AI crawler access). Get it right and it manages crawl budget, points bots to your sitemap, and controls which AI products can use your content. Get it wrong and you can accidentally de-index your entire site overnight.

The core rules: keep it simple, test every change, never block CSS/JS, remember it’s not security, and stay current on AI bot names as new crawlers appear. For how robots.txt interacts with indexing decisions and broader on-page SEO, see our guides on crawlability vs. indexability, on-page SEO tips, and duplicate content issues.

Leave a Comment

Your email address will not be published. Required fields are marked *