Policy

OpenAI Respects robots.txt — For Now

⏱ 6 min read · AI Readiness

In August 2023, OpenAI announced GPTBot, a web crawler that would respect robots.txt. Site owners could block it, and OpenAI would honor that block. It was a peace offering to publishers who were increasingly vocal about AI companies training on their content without permission.

Anthropic followed with ClaudeBot. Google introduced Google-Extended. The message was clear: we're the good guys. We respect your wishes. Just update your robots.txt, and we'll stay away.

But here's the uncomfortable truth: this is a temporary arrangement. And everyone involved knows it.

The Economic Pressure

AI companies need data. Massive amounts of it. The quality of a language model is directly tied to the diversity and volume of its training data. GPT-4 was trained on trillions of tokens. GPT-5 will need even more.

As more sites block AI crawlers, the available training data shrinks. And as the available data shrinks, the pressure to ignore robots.txt grows. At some point, the competitive disadvantage of respecting opt-outs becomes too large to ignore.

Right now, OpenAI can afford to respect robots.txt because most sites haven't blocked them yet. But if 50% of the web opts out? 70%? At what threshold does respecting robots.txt become a strategic liability?

The Legal Ambiguity

Robots.txt is not a legal document. It's a convention. A norm. There's no law that requires crawlers to respect it. The only enforcement mechanism is reputation — and reputation only matters if you're operating in a market where reputation has value.

For OpenAI, Anthropic, and Google, reputation matters. They're Western companies subject to Western media scrutiny and Western regulatory pressure. Ignoring robots.txt would be a PR disaster.

But what about the next generation of AI companies? The ones launching in jurisdictions where Western PR doesn't matter? The ones that see AI as a strategic national priority and web scraping as a means to that end?

The companies that respect robots.txt today are the ones that can afford to. The companies building models tomorrow may not have that luxury.

The Licensing Deals

OpenAI knows the opt-out model is unsustainable. That's why they're signing licensing deals with publishers. Reddit, Stack Overflow, the Associated Press — all have agreements that give OpenAI access to their content in exchange for payment.

These deals accomplish two things. First, they secure high-quality training data that can't be blocked via robots.txt. Second, they establish a precedent: content has value, and AI companies should pay for it.

But licensing deals only work for large publishers with negotiating power. The long tail of the web — millions of blogs, forums, and niche sites — doesn't have the leverage to demand payment. For them, robots.txt is the only option.

The Training Data Shortage

We're approaching a point where the easily accessible, high-quality web content has been exhausted. The major AI labs have already scraped Common Crawl, Wikipedia, GitHub, and every major publication. What's left is either behind paywalls, blocked by robots.txt, or low-quality.

This is why AI companies are turning to synthetic data — using AI to generate training data for other AI. But synthetic data has limits. Models trained on synthetic data tend to collapse, producing increasingly generic outputs.

The solution is either licensing deals or ignoring robots.txt. And for most of the web, licensing isn't an option.

The Regulatory Wildcard

The EU AI Act requires transparency in training data. California is considering similar legislation. If these laws pass and are enforced, AI companies may be legally required to respect opt-outs.

But regulation moves slowly. The AI Act won't be fully enforced for years. And even when it is, enforcement will be uneven. A company operating from a jurisdiction with weak IP laws can ignore EU regulations with minimal consequences.

The real question is whether the US will regulate AI training data. If it does, the opt-out model might become legally binding. If it doesn't, robots.txt remains a gentleman's agreement — and gentlemen's agreements break down when the stakes get high enough.

What Happens Next

In the short term, OpenAI and the major labs will continue respecting robots.txt. It's good PR, it avoids lawsuits, and most sites haven't blocked them yet.

In the medium term, we'll see more licensing deals, more synthetic data, and more aggressive scraping from companies that don't care about Western norms. ByteSpider and PetalBot are just the beginning.

In the long term, one of two things happens: either regulation forces an opt-in model, where AI companies must get explicit permission before training on content, or the opt-out model collapses entirely, and robots.txt becomes meaningless.

Why You Should Block Anyway

Even if the opt-out model is temporary, blocking AI crawlers now is still worth it. It prevents your content from being used in the next generation of models. It sends a signal that you care about how your work is used. And it gives you leverage if licensing deals become the norm.

OpenAI respects robots.txt today. But "for now" is doing a lot of work in that sentence. And the companies that will dominate AI in five years may not be the ones respecting norms today.

Block AI training bots before they scrape your content. Check your AI Readiness Score and get a robots.txt template that works.