Bot Control

Blocking GPTBot Doesn't Block ChatGPT

⏱ 5 min read · AI Readiness

You add GPTBot to your robots.txt. You feel good about it. You've blocked OpenAI from training on your content. Problem solved, right?

Not quite. Because GPTBot is only one of OpenAI's crawlers. And blocking it doesn't prevent ChatGPT from accessing your site when a user asks it to search the web. Those are two different bots, two different use cases, and two different user agent strings.

Training vs. Inference

GPTBot is OpenAI's training crawler. It scrapes the web to build datasets for future models. When you block GPTBot, you're preventing your content from being used in the next version of GPT.

ChatGPT-User is OpenAI's search crawler. It fetches web pages in real-time when a ChatGPT user asks for current information. When you block ChatGPT-User, you're preventing ChatGPT from citing your site in search results.

These are separate bots with separate purposes. Blocking one doesn't block the other. And most site owners don't realize this until they see ChatGPT citing their content despite having blocked GPTBot.

Why OpenAI Uses Two Bots

The distinction makes sense from OpenAI's perspective. Training is a one-time operation that happens offline. Inference is a real-time operation that happens when users make requests.

Training requires bulk access to large datasets. Inference requires targeted access to specific pages. The crawl patterns are different, the bandwidth requirements are different, and the ethical considerations are different.

From a site owner's perspective, though, the distinction is frustrating. You have to know about both bots, understand what each one does, and decide whether to block one, both, or neither.

Blocking AI training doesn't block AI inference. And most site owners don't realize they need to block both.

The Google Situation

Google has a similar split. Googlebot is the traditional search crawler. Google-Extended is the AI training crawler. Blocking Google-Extended prevents your content from being used to train Bard and other Google AI models. But it doesn't prevent Googlebot from indexing your site for traditional search.

This is actually the right design. Site owners should be able to opt out of AI training while still participating in traditional search. The problem is that most site owners don't know Google-Extended exists.

They block Googlebot thinking they're blocking all Google access. Or they don't block anything because they want to stay in Google search, not realizing that means they're also opted into AI training.

Anthropic's Approach

Anthropic uses ClaudeBot for both training and inference. It's a single bot with a single user agent string. Block ClaudeBot, and you block all Claude access — training, search, and real-time queries.

This is simpler but less flexible. You can't opt out of training while allowing inference. It's all or nothing.

From a site owner's perspective, this is easier to understand. From a product perspective, it's limiting. If Claude adds real-time web search, sites that blocked ClaudeBot for training reasons will also be excluded from search results.

The User Agent Confusion

The real problem is that there's no standard. Every AI company uses different user agent strings for different purposes. Some separate training and inference. Some don't. Some use descriptive names. Some use cryptic codes.

A site owner who wants to block AI training but allow AI search needs to know: - GPTBot (training) vs. ChatGPT-User (search) - Google-Extended (training) vs. Googlebot (search) - ClaudeBot (both) - And a dozen other bots from smaller AI companies

This is not a reasonable expectation. Most site owners don't have the time or expertise to research every AI bot, understand its purpose, and make informed blocking decisions.

What You Should Block

If you want to prevent AI training but allow AI search, block the training bots: GPTBot, Google-Extended, ClaudeBot (if you're okay losing Claude search), and the various dark crawlers.

If you want to prevent all AI access, block everything: GPTBot, ChatGPT-User, Google-Extended, Googlebot (if you're okay losing Google search), ClaudeBot, and every other AI bot you can identify.

If you want to allow everything, don't block anything. But understand that "allowing everything" means your content will be used for training, inference, and whatever other purposes AI companies invent next.

The Naming Problem

The fact that "GPTBot" and "ChatGPT-User" sound similar but do different things is a design failure. Users assume that blocking GPTBot blocks all OpenAI access. The naming suggests that. The reality doesn't match.

A clearer naming scheme would be GPTBot-Training and GPTBot-Search. Or OpenAI-Training and OpenAI-Search. Something that makes the distinction obvious.

But we're stuck with the current names. And site owners are stuck trying to figure out which bot does what.

The Bigger Issue

The training vs. inference distinction highlights a deeper problem: AI companies control the rules. They decide how many bots to use, what to name them, and what each one does.

Site owners are left playing catch-up, trying to understand an ever-changing landscape of user agents, crawl patterns, and usage policies. And by the time you figure out which bots to block, new ones have already launched.

Blocking GPTBot doesn't block ChatGPT. And that's not a bug. It's a feature — for OpenAI. For site owners, it's just another layer of complexity in an already confusing system.

See exactly which AI bots are accessing your site with State of AI's Bot Analyzer — including the training bots, search bots, and everything in between.