robots.txt Is No Longer Optional

For most of the web's history, robots.txt was a polite suggestion. A file that lived at the root of your domain, telling well-behaved search crawlers where they could and couldn't go. If you didn't have one, Google would index everything anyway. No harm, no foul.

That era is over. In the age of AI training, robots.txt has become the only line of defense between your content and the largest data harvesting operation in human history. And if you don't have one, you've already lost.

The Consent Problem

When Google crawls your site, it's building an index. When GPTBot crawls your site, it's building a model. The difference matters. One creates a reference to your content. The other absorbs your content into a neural network that will generate derivative works forever, with no attribution, no traffic back to you, and no way to opt out after the fact.

The AI companies know this is ethically murky. That's why every major AI lab now publishes a user agent string and claims to respect robots.txt. OpenAI has GPTBot. Anthropic has ClaudeBot. Google has Google-Extended. They're all blockable — in theory.

The problem isn't that AI companies ignore robots.txt. The problem is that most websites don't have one that says anything about AI.

Silence Is Consent

Here's the default assumption in AI training: if your robots.txt doesn't explicitly block a bot, that bot assumes it has permission. No robots.txt file? Full access. A robots.txt that only blocks ancient spam crawlers from 2008? Full access.

This isn't a bug. It's the design. The AI industry adopted the same "opt-out" model that email marketing used in the 1990s — assume permission until explicitly told otherwise. And just like email marketing, the burden is entirely on you to know which bots exist and to block them by name.

The User Agent Arms Race

Blocking AI crawlers isn't as simple as adding a single line. Every AI lab uses a different user agent. Some use multiple. OpenAI has separate bots for training (GPTBot) and for search (ChatGPT-User). Google has separate bots for search (Googlebot) and for AI training (Google-Extended).

If you want to block AI training but allow search indexing, you need to know the difference. And you need to keep that list updated as new models launch, new labs emerge, and existing bots rebrand.

The Dark Crawler Problem

Then there are the bots that don't respect robots.txt at all. ByteSpider, operated by ByteDance (TikTok's parent company), is one of the most aggressive. It crawls at scale, ignores rate limits, and has been caught rotating user agents to evade blocks. PetalBot, from Huawei, behaves similarly.

These aren't rogue operations. They're crawlers run by billion-dollar companies that have decided the rules don't apply to them. And your only defense is IP-level blocking, which requires server access most site owners don't have.

What a Modern robots.txt Looks Like

A minimal AI-aware robots.txt in 2025 blocks training bots while allowing search crawlers. That means explicitly naming GPTBot, Google-Extended, ClaudeBot, Omgilibot, and a dozen others. It means checking that list every quarter as new models launch.

It also means understanding that blocking these bots only prevents future training. If your content was already scraped, it's already in the dataset. The model has already learned from it. You can't unring that bell.

The Bigger Question

The real issue isn't technical. It's philosophical. Should the default be opt-in or opt-out? Should AI companies need explicit permission to train on your content, or should you need to explicitly block them?

The AI industry has chosen opt-out. And until regulation forces a different model, robots.txt is the only tool you have. Which means it's no longer optional.

Check your website's AI crawler exposure with State of AI's Readiness Checker — see which bots have access to your content and get a robots.txt template that actually works.