How to Get Google to Crawl the Right Content
- Last Edited April 19, 2026
- by Garenne Bigby
“Getting Google to crawl the right content” is really two problems in one: making sure Googlebot can find and render the pages you want ranked, and making sure it doesn’t waste time (or worse, index) the pages you don’t. In 2026, the tools for both have changed substantially since Google’s 2018 Search Console redesign, and a few more since — Fetch as Google is gone, the old Blocked Resources Report is gone, and noindex via robots.txt stopped working in September 2019. Here’s how crawl control actually works now.
How Googlebot finds content
Google discovers new URLs three ways:
- Following internal links from pages it already knows about — the dominant path for most sites.
- Following external links from other domains that link to yours.
- XML sitemaps submitted through Search Console or declared in robots.txt — especially useful for pages that aren’t well-linked internally.
Once discovered, a URL enters the crawl queue. Googlebot fetches the HTML on first pass, then queues JavaScript rendering for a second pass (performed on an evergreen Chromium-based renderer since 2019). The fully-rendered content is what gets indexed. Our crawl budget guide covers how crawl demand and capacity interact; for most sites under a few thousand pages, Google will crawl everything it wants to without budget intervention.
Expose what you want Google to find
Visibility starts with making the right pages reachable, renderable, and unambiguous.
Keep critical pages well-linked
Orphan pages — those without internal inbound links — are the most common reason pages go undiscovered. Every page worth indexing should be linked from at least one other indexable page. Important pages should sit within 3 clicks of the homepage. Navigation menus, breadcrumbs, related-content modules, and pillar/cluster architectures all contribute to that reachability.
Submit an XML sitemap
A sitemap gives Google a direct list of canonical URLs to consider crawling. Requirements haven’t changed in years — 50,000 URLs and 50 MB uncompressed per file, with sitemap indexes to aggregate more. Submit through Search Console at Indexing → Sitemaps. Google deprecated the sitemap ping endpoint in June 2023; now Google picks up changes automatically via the Last-Modified HTTP header and <lastmod> field. Keep those accurate and let Google poll on its own cadence.
Don’t accidentally block CSS, JavaScript, or images
Googlebot renders pages the way a browser does. If your robots.txt disallows /wp-content/, /assets/, or similar, you can prevent Google from loading the CSS and JS it needs to understand the page. Modern best practice: allow all static-asset directories; block specific low-value URLs (internal search results, filter URLs, session-ID parameters) rather than entire directories. Legacy Disallow: /*.css$ or Disallow: /wp-includes/ rules should be removed from any current robots.txt.
Render content server-side when SEO matters
Client-side-rendered pages (classic React/Vue/Angular SPAs without SSR) rely entirely on Google’s second-pass JavaScript execution. That delay is unpredictable and many AI crawlers (GPTBot, ClaudeBot, PerplexityBot, CCBot) don’t render JavaScript at all. Use server-side rendering (Next.js, Nuxt, Astro, SvelteKit, Remix) or static generation for pages you want indexed and cited. Our JavaScript SEO guide covers rendering strategies in detail.
Use canonical tags to consolidate
When similar content exists at multiple URLs (filter variants, tracking-tagged URLs, trailing slashes, HTTP and HTTPS versions), a self-referencing <link rel="canonical"> on each page tells Google which URL to treat as authoritative. Without canonicals, Google picks for you — sometimes wrongly.
Block what you don’t want Google to index
Crawl control has two distinct mechanisms that are frequently confused. Use the right one for the right job:
robots.txt — controls crawling, not indexing
robots.txt tells Googlebot not to fetch specified URLs. It does not prevent those URLs from appearing in search results if Google learns about them through external links. If you Disallow a URL but external sites link to it, it may appear in results with a “No information is available for this page” snippet. The Robots Exclusion Protocol was formally standardized as RFC 9309 in September 2022.
What robots.txt is good for:
- Blocking low-value filter URLs, internal search results, session-scoped URLs.
- Preventing Googlebot from wasting crawl budget on URLs that will never rank.
- Blocking access to admin endpoints, staging subdomains, or cart/checkout URLs.
- Controlling AI crawlers (see below).
What robots.txt is not good for:
- Removing already-indexed URLs from search results.
- Hiding private content from the public — anyone who knows the URL can still load it.
noindex — removes from search results
To keep a URL out of search results entirely, use the noindex meta tag or X-Robots-Tag HTTP header on the page. Critical: Google stopped supporting noindex as a robots.txt directive in September 2019. If you still have Noindex: lines in robots.txt, they do nothing.
The correct way to prevent indexing:
<meta name="robots" content="noindex">
or via HTTP header:
X-Robots-Tag: noindex
For noindex to work, Googlebot must be able to crawl the page to see the directive. Don’t disallow a URL in robots.txt and add a noindex meta tag — Google won’t crawl the page to see the noindex, and the URL may linger in the index.
Password protection — hides content entirely
For genuinely private content (staging environments, internal wikis, preview URLs), HTTP authentication or application-level login is the only real protection. robots.txt is a politeness request that well-behaved crawlers honor; any crawler can ignore it. Adversarial scrapers, malicious crawlers, and some legitimate bots (e.g., some archival services) do.
URL removal for emergencies
Search Console’s Removals tool can temporarily hide URLs from Google search results (about 6 months). Useful when you need to suppress a page immediately while implementing a permanent fix (noindex, 410 Gone, or deletion). Not a substitute for permanent deindexing.
Controlling AI crawlers
Since mid-2023, a new class of crawlers has emerged — LLM training and retrieval bots from OpenAI, Anthropic, Google, and others. They honor robots.txt and respond to specific user-agent directives. Common entries:
# OpenAI training + search
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic
User-agent: ClaudeBot
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Common Crawl (used by many LLMs)
User-agent: CCBot
Disallow: /
# Google's AI/Gemini + AI Overviews
User-agent: Google-Extended
Allow: /
# Meta / LLaMA training
User-agent: Meta-ExternalAgent
Allow: /
Each crawler has a specific policy. GPTBot crawls for OpenAI model training (many sites block). OAI-SearchBot serves ChatGPT’s live search (more sites allow — it drives traffic). Google-Extended is Google’s specific opt-out for Gemini and AI Overviews training. Allowing Googlebot but blocking Google-Extended keeps you in search results while excluding you from AI training datasets. Blocking Googlebot itself removes you from search entirely.
Check what Google actually sees
Diagnostics moved from the old standalone tools (Fetch as Google, Blocked Resources Report) into URL Inspection and the modern Search Console reports.
URL Inspection tool
URL Inspection — accessed from the top search bar in Search Console — is the direct replacement for Fetch as Google (retired in 2018). Paste any URL on your property to see:
- Indexed URL vs submitted canonical — whether Google chose the same canonical you did.
- Last crawl date and crawl outcome — successful, blocked, failed.
- Rendered HTML and screenshot — what Googlebot actually saw after JavaScript execution. This is the closest 2026 equivalent to “fetch as Google.”
- Robots.txt status — whether the URL is blocked by robots.txt.
- Page resource loading — which CSS, JavaScript, and image files Googlebot could or couldn’t access while rendering.
- Request Indexing button — submits the URL to an accelerated crawl queue (replacing Fetch as Google’s “Submit to Index”).
Pages report
Under Indexing → Pages, Google reports which URLs on your site are indexed and which aren’t, with specific reasons for the non-indexed ones: Crawled — currently not indexed, Discovered — currently not indexed, Blocked by robots.txt, Excluded by noindex tag, Duplicate without user-selected canonical, Redirect, and more. This is the authoritative answer to “why isn’t page X showing up?”
Crawl Stats report
Under Settings → Crawl Stats, you get 90 days of data on total crawl requests, response times, host status, and breakdowns by file type, response code, purpose (discovery vs refresh), and Googlebot type (smartphone, desktop, image, video). Sudden drops or spikes in 5xx responses are signals worth investigating.
Robots.txt report
The old standalone robots.txt Tester was folded into a consolidated Settings → robots.txt report in 2023. It shows Googlebot’s current view of your robots.txt file (it may be cached for up to 24 hours), parsing status, and any syntax errors. For live syntax testing, use the standalone robots.txt testing tool or third-party validators.
Common mistakes
- Disallowing CSS/JS directories. Googlebot needs them to render pages. Check robots.txt for any blanket blocks on
/css/,/js/,/wp-includes/, or/assets/. - Disallowing and noindexing the same URL. Pick one: disallow for crawl control, noindex for index control. Doing both usually means the noindex never takes effect.
- Still using
Noindex:in robots.txt. Google stopped supporting this in September 2019. Move to meta robots or X-Robots-Tag. - Blocking Googlebot from staging accidentally. Staging sites should use HTTP auth or IP restrictions, not robots.txt — and production robots.txt should not be copied from staging.
- Relying on robots.txt to hide sensitive content. It’s a politeness request, not a security mechanism.
- Submitting a sitemap with 404 URLs. Google’s Pages report flags these under “Submitted URL not found”. Clean dead URLs out of sitemaps regularly.
- Leaving test/dev URLs in sitemaps. Keeps Google re-crawling pages you don’t want indexed.
- Client-side-only rendering of critical content. Even with Googlebot’s evergreen renderer, second-pass delay makes client-side-only content unreliable — and AI crawlers largely can’t execute JavaScript at all.
A practical crawl-control audit
For any site, a quick audit flow that catches most problems:
- View your robots.txt at
/robots.txt. Is it intentional? Are any CSS/JS/image directories blocked? Are there staleNoindex:directives (now ignored)? - Search Console → Settings → robots.txt. Does Google’s view match what you expect?
- Search Console → Indexing → Sitemaps. Are your sitemaps submitted and all in “Success” status?
- Search Console → Indexing → Pages. How many URLs are “Not indexed” and why? Any surprises in the reasons breakdown?
- URL Inspection on a key page. Does Google’s rendered view include the content you see in your browser?
- Crawl Stats for anomalies. Sudden drops in requests, spikes in 5xx responses, or slow response times?
- AI crawler coverage in robots.txt. Have you made intentional decisions about GPTBot, ClaudeBot, Google-Extended, and CCBot?
Frequently asked questions
Is “Fetch as Google” still available?
No. It was retired in 2018 when Search Console was redesigned. URL Inspection is the direct replacement — same core functionality (fetch, render, screenshot, resource-loading, submit-to-index) in the modern Search Console interface.
Where’s the Blocked Resources Report?
Also retired. Resource-blocking diagnostics are now surfaced per-URL through URL Inspection’s “Page resources” panel, which shows exactly which CSS, JavaScript, and image files Googlebot could or couldn’t load while rendering the page.
Should I still use robots.txt in 2026?
Yes. It remains the standard way to control crawl access and has been formalized as RFC 9309 since September 2022. Use it to block low-value URLs, control AI crawlers, and point Google at your sitemap (Sitemap: https://example.com/sitemap.xml). Don’t use it to try to prevent indexing or hide sensitive content.
How do I fix “Crawled — currently not indexed”?
This status usually means Google crawled the page, decided it wasn’t worth indexing, and moved on. Causes: thin content, low-quality writing, near-duplicate of another page, low authority signals, or the page isn’t meaningfully different from competing results. The fix is rarely technical — it’s improving the page’s quality, adding meaningful content, and building topical relevance.
Will allowing AI crawlers hurt my SEO?
Allowing Googlebot is required for Google Search. Google-Extended is a separate directive for Gemini and AI Overviews — allowing it helps you appear as a cited source in AI Overviews (which earn ~35% more clicks than uncited pages). Blocking GPTBot or ClaudeBot doesn’t affect Google rankings but prevents those specific LLMs from using your content in training data. Each decision is independent.
What’s the rate limit on URL Inspection’s “Request Indexing”?
Roughly 10–12 submissions per day per property in 2026, down from higher limits in earlier years. For bulk re-indexing after major site changes, rely on the updated XML sitemap and Google’s natural recrawl cadence rather than manual submissions.
Bottom line
Crawl control in 2026 is about three clear jobs: expose your indexable content through clean internal linking, server-rendered HTML, and submitted sitemaps; block what shouldn’t be indexed using the right mechanism (robots.txt for crawl control, noindex for index removal, password protection for privacy); and verify what Google actually sees via URL Inspection, the Pages report, and Crawl Stats. The Fetch-as-Google / Blocked-Resources-Report era is over — everything those tools did is now surfaced in URL Inspection and the modern Search Console reports. Make intentional decisions about AI crawlers while you’re at it, and the site that Google crawls will match the site you want ranked.
Categories
- Last Edited April 19, 2026
- by Garenne Bigby