Home Blog Search Engine Optimization How to Prevent Blacklisting When Scraping

How to Prevent Blacklisting When Scraping

Q: How do I stop AI crawlers like GPTBot and ClaudeBot from scraping my site?

Block them in your robots.txt by User-Agent name (User-agent: GPTBot, Disallow: /, plus the same pattern for ClaudeBot). OpenAI, Anthropic, Google, and Perplexity have all committed to respecting robots.txt for their training crawlers. Anthropic differentiates ClaudeBot (training), Claude-SearchBot (search retrieval), and Claude-User (direct user queries) — each has a separate User-Agent you can allow or block individually.

Last Edited April 19, 2026
by Garenne Bigby

Web scraping — automatically collecting data from web pages — powers a huge slice of the modern internet. Search engines scrape to build their indexes. Price-comparison tools scrape retailers. Archivists scrape public records. And teams running their own content audits scrape their own sites to catalog what’s there.

But scraping only works if you don’t get blocked. Sites have every right to defend themselves against abusive traffic, and in 2026 their defenses are sophisticated: rate limits, bot fingerprinting, CAPTCHAs, and Cloudflare-style managed bot rules. The guidance below is about being a good citizen of the web — legally, ethically, and technically — so your legitimate scraper stays welcome.

Is Web Scraping Legal?

The short answer: scraping publicly accessible data is generally legal in the US, but the details matter a lot. A few recent cases shape the landscape:

hiQ Labs v. LinkedIn — The Ninth Circuit ruled in 2019 (reaffirmed in April 2022) that scraping public profiles didn’t violate the Computer Fraud and Abuse Act, because the data wasn’t behind a login wall. The case ultimately settled in 2022, with hiQ paying $500,000 and destroying the scraped data — but on contract grounds, not CFAA.
Meta v. Bright Data (January 2024) — A federal judge ruled that Bright Data could only violate Meta’s terms of service by scraping data while logged into a Meta account. Scraping logged-out, public pages was not actionable.
CNIL v. KASPR (2024) — France’s data protection authority fined KASPR €240,000 for scraping LinkedIn data without a GDPR lawful basis, even though the profile data was publicly visible. Public doesn’t mean unregulated when personal data is involved.

The pattern: scraping public data is mostly a legal right. Scraping behind a login wall, bypassing anti-bot controls, or processing EU residents’ personal data without a GDPR lawful basis moves you into risky territory fast. When in doubt, talk to a lawyer — and treat the site’s terms and robots.txt as the first line of compliance, not the last.

Use an API First

Before writing a single line of scraping code, check whether the site offers a public API. Most major sites — X, Reddit, YouTube, Google, LinkedIn (with restrictions), and thousands of retailers — expose structured APIs that are faster, more reliable, and explicitly authorized.

APIs usually require a key and have their own rate limits, but they eliminate the blocking, CAPTCHA, and layout-change headaches that plague scrapers. If an API exists for the data you need, use the API. Scrape only when no API exists or the API doesn’t expose what you need.

How Sites Detect Scrapers

Modern anti-bot systems use a stack of detection signals. A scraper that trips any one of them risks being throttled or blocked:

Traffic patterns — too many requests from one IP in a short window, or requests with perfectly regular timing, stand out against human browsing patterns.
Browser fingerprinting — sites compare TLS fingerprints, HTTP header order, JavaScript execution environment, Canvas rendering, WebGL, and installed fonts against known bot signatures.
Honeypot links — invisible URLs only a naive crawler would follow. Following one is a tell.
CAPTCHA challenges — hCaptcha, reCAPTCHA, and Cloudflare Turnstile interrupt the session to verify a human is at the wheel.
Managed services — Cloudflare Bot Management, Akamai Bot Manager, and DataDome use ML models to score each request for bot likelihood. They’re now on most medium-to-large sites.

For a broader look at the crawler ecosystem and how different tools approach these challenges, see our overview of website crawlers for content monitoring.

How to Tell You’ve Been Blocked

Common signals that a site has decided your scraper is unwelcome:

HTTP 429 (Too Many Requests) — You’re being rate-limited. Slow down.
HTTP 403 (Forbidden) — The site is refusing to serve you.
HTTP 503 (Service Unavailable) — Sometimes a real outage; sometimes a soft block.
Suddenly receiving a CAPTCHA challenge on a page that previously loaded normally.
Content replaced with a challenge page, or content that’s dramatically shortened or stripped of key data.
Empty or redirected responses from pages that still have clear content in a normal browser.

Always log your HTTP response codes and set up alerts on a spike in non-200 responses. Catching a block early means you can slow down and fix the cause before the site escalates to a permanent ban.

Respect robots.txt

The robots.txt file at the root of any site (e.g., https://example.com/robots.txt) tells crawlers which paths are allowed and which aren’t. Ethical scrapers check it before every run. Here’s why it matters in 2026:

It’s the site’s stated preference. Ignoring it is a bad-faith signal that courts and regulators take seriously — the French CNIL explicitly calls out robots.txt compliance as a factor in GDPR legitimate-interest assessments.
It’s granular. Disallow: /private/ blocks one section; Crawl-Delay: 10 sets minimum seconds between requests; Allow: re-enables specific subpaths.
It names bots individually. User-agent: GPTBot applies only to OpenAI’s crawler. User-agent: * is the default for everyone else. If your scraper has its own User-Agent (it should), site owners can opt you in or out specifically.

Read robots.txt. Follow it. If the site has disallowed what you want, that’s the site’s decision — don’t work around it.

Read the Terms of Service

Beyond robots.txt, the site’s Terms of Service spell out what you can and can’t do with its content. Key questions:

Does the ToS explicitly prohibit scraping, automated access, or commercial use of the data?
Is there an attribution requirement if you republish or derive products from the data?
Are there specific rate limits or authorization requirements?

Terms are usually enforced as contracts, not criminal law — but that’s not reassurance. The Meta v. Bright Data ruling turned on whether Bright Data had agreed to Meta’s terms by logging in. If you create an account, accept cookies, or click through a paywall, you have almost certainly agreed to the site’s ToS. Honor it.

Identify Yourself With a Clear User-Agent

A polite scraper announces itself. Set a User-Agent header like:

MyCompanyBot/1.0 (+https://example.com/bot; bot@example.com)

Include your product name, a version, and a URL or email site owners can use to contact you. Transparent identification prevents blocks, misunderstandings, and legal escalation. Spoofing a browser User-Agent to hide that you’re a bot is the opposite of this advice — it’s a signal of bad intent, and anti-bot systems are better at catching spoofed agents than most scrapers think.

Rate-Limit and Back Off

Scraping at browser speed is a great way to get blocked. Start conservatively:

Small sites: one request every 3 to 5 seconds.
Large sites with robust infrastructure: 1-2 requests per second.
Any site, any time: respect a Crawl-Delay directive in robots.txt if one is set.

When you hit a 429 or 503, back off. A common pattern is exponential backoff with jitter: wait 2 seconds, then 4, then 8, then 16, with a small random offset so parallel workers don’t retry in lockstep. Libraries like Python’s tenacity or Scrapy’s AutoThrottle middleware handle this automatically. Never run a scraper at full speed and “hope for the best.” That’s how small sites get taken down.

Cache Aggressively

Every request you don’t need to make is a request that won’t get you blocked. Use HTTP caching headers — If-Modified-Since and If-None-Match (ETag) — to avoid re-downloading unchanged pages. Cache responses locally during development so you can iterate without hammering the source.

For scheduled scrapes, compare the site’s sitemap.xml lastmod timestamps against your last-known values and only fetch pages that have changed. That single change can cut request volume by 90% or more on content-heavy sites.

Handle Errors Gracefully

A scraper that crashes on a 500 error and retries immediately is a scraper about to be blocked. Build in:

Exponential backoff on 429, 503, and connection errors
Circuit breakers that pause the scraper for a longer period after repeated failures
Alerting on sustained error rates so a human can investigate
Graceful degradation — skip a bad URL, log it, and move on rather than retrying forever

Proxies, VPNs, and IP Rotation

Rotating IP addresses through a proxy pool or VPN is a standard tool for large-scale scraping — but the ethics cut both ways.

Legitimate uses: distributing load so one site doesn’t see concentrated traffic from a single origin; testing geo-restricted content; avoiding shared hosting IPs that may already be blocklisted through no fault of yours. Residential-proxy providers like Bright Data and Oxylabs serve these markets commercially.

Abusive uses: rotating IPs specifically to evade a site’s rate limits or explicit block. If a site has said “no,” cycling through IPs to keep taking data is the scraping equivalent of showing up at the same front door in different disguises. It’s the behavior that pushed hiQ into settlement, and it’s what “anti-bot evasion” case law turns on.

Use proxies when you have a legitimate distribution reason. Don’t use them to break a site’s explicit wishes.

Headless Browsers and Automation Tools

When a site’s content depends on JavaScript, a plain HTTP client can’t see it. A headless browser — a real browser running without a GUI — renders the page like Chrome or Firefox would, and then your script reads the DOM.

The modern tools:

Playwright (Microsoft) — cross-browser, first-class Python/JavaScript/.NET support, the de facto standard in 2026
Puppeteer (Google) — Chrome/Chromium only, JavaScript-native
Chrome Headless and Firefox Headless — the browsers themselves, driven by the tools above

A common misconception: “Selenium” and “Python” are not headless browsers. Python is a language. Selenium is an automation toolkit. Both drive headless browsers; they don’t replace them. Selenium is still common for cross-browser testing but is no longer the best choice for scraping. Note that modern anti-bot systems detect stock headless-browser configurations easily — the point of a headless browser is to execute JavaScript, not to hide.

When You’ve Already Been Blocked

If you find your scraper blocked, stop. Don’t immediately rotate IPs and try again — that looks exactly like the behavior the site was trying to stop, and will often escalate a soft block to a permanent ban.

Instead:

Diagnose. Check status codes; confirm it’s a block and not a layout change or site outage.
Wait. A rate-limit block often clears in minutes to hours. Don’t retry during the window.
Identify the trigger. Were you over rate? Following honeypots? Ignoring robots.txt?
Fix the root cause. Slow down, identify your scraper properly, respect robots.txt.
Contact the site owner if your use case is legitimate. Many sites will whitelist research or integration-building projects if you ask. The contact URL in your User-Agent is exactly what this is for.

Frequently Asked Questions

Is web scraping legal?

Scraping publicly accessible data is generally legal in the US per the Ninth Circuit’s hiQ v. LinkedIn ruling, but bypassing login walls, circumventing anti-bot controls, or scraping EU residents’ personal data without a GDPR lawful basis all move you into risky territory. The site’s Terms of Service and robots.txt are your first compliance checkpoints.

What’s the difference between ethical and abusive scraping?

Ethical scraping respects robots.txt, uses a clear and contactable User-Agent, rate-limits itself, backs off on errors, prefers APIs where available, and accepts “no” when a site says no. Abusive scraping ignores these signals and tries to evade detection instead of complying. Anti-bot vendors, regulators, and courts are getting better at telling the difference.

Do I have to follow robots.txt?

Legally, robots.txt is not always binding on its own. Practically, ignoring it is a strong negative signal that will be held against you in any dispute — the French CNIL explicitly cites robots.txt compliance in GDPR enforcement. Ethically, it’s the site’s stated preference. Follow it.

How do I stop AI crawlers like GPTBot and ClaudeBot from scraping my site?

Block them in your robots.txt by User-Agent name:

User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /

OpenAI, Anthropic, Google, and Perplexity have all committed to respecting robots.txt for their training crawlers. Anthropic differentiates ClaudeBot (training), Claude-SearchBot (search retrieval), and Claude-User (direct user queries) — each has a separate User-Agent you can allow or block individually.

Bottom Line

The fastest way to get blocked when scraping is to act like someone trying not to be noticed. The fastest way to stay welcome is the opposite: show up transparently, check robots.txt, identify yourself, rate-limit, respect the site’s terms, and use an API if one exists. These aren’t workarounds to “prevent blacklisting” — they’re the reason blacklisting exists in the first place, and skipping them creates the problem you’re trying to avoid.

Scrape politely. Take the law seriously. And when a site says no, let it mean no. The internet works better when both sides play fair.