Security

The Dark Bots You've Never Heard Of

⏱ 6 min read · AI Readiness

You know about GPTBot. You've heard of ClaudeBot. Maybe you've even blocked Google-Extended in your robots.txt. But while you were focused on the bots that announce themselves, a different class of crawlers has been quietly harvesting your content at scale.

They don't respect robots.txt. They don't honor rate limits. They rotate user agents to evade detection. And they're operated by some of the largest tech companies in the world — just not the ones you're thinking of.

ByteSpider: The Aggressive Giant

ByteSpider is ByteDance's web crawler. The same company that runs TikTok operates one of the most aggressive scraping operations on the internet. It crawls at volumes that rival Googlebot, but unlike Google, ByteSpider has been repeatedly caught ignoring robots.txt directives.

The official purpose is search indexing for ByteDance's search products in China. The actual use? AI training for recommendation algorithms, content generation models, and whatever else ByteDance is building behind closed doors. The company doesn't publish training data sources. It doesn't offer opt-out mechanisms beyond robots.txt — which it frequently ignores.

Site owners have reported ByteSpider consuming hundreds of gigabytes of bandwidth, hitting servers thousands of times per minute, and rotating through IP ranges to evade blocks. When confronted, ByteDance's response is usually silence.

PetalBot: Huawei's Silent Scraper

PetalBot is Huawei's search crawler, ostensibly for Petal Search, the company's Google alternative. But Petal Search has minimal market share outside China, which raises the question: why is PetalBot crawling the entire English-language web?

The answer, according to researchers who've analyzed its behavior, is AI training. PetalBot's crawl patterns don't match typical search indexing. It targets content-heavy sites — blogs, forums, documentation, tutorials — and downloads entire archives in single sessions.

Like ByteSpider, PetalBot has been caught ignoring robots.txt. Unlike ByteSpider, it's also been observed spoofing user agents, pretending to be Googlebot or other legitimate crawlers to bypass blocks.

The SEO Tool Crawlers

Then there are the bots that aren't technically "dark" but behave like they are. AhrefsBot and SemrushBot are legitimate SEO tools, but they crawl at such aggressive rates that many site owners treat them as hostile.

These bots aren't training AI models — they're building backlink databases and competitive intelligence tools. But the impact on your server is the same: high bandwidth consumption, increased load, and no benefit to you unless you're a paying customer of those services.

The difference is that Ahrefs and Semrush do respect robots.txt. If you block them, they stop. The dark bots don't.

Why Robots.txt Doesn't Work

The robots.txt protocol is a gentleman's agreement. It has no enforcement mechanism. A bot that chooses to ignore it faces no technical barrier — just the risk of bad PR if it gets caught.

For Western AI companies, that risk matters. OpenAI, Anthropic, and Google have reputations to protect. They respect robots.txt because the backlash from ignoring it would be worse than the data they'd gain.

For companies operating primarily in markets where Western PR doesn't matter, that calculus is different. ByteSpider and PetalBot don't care if TechCrunch writes a critical article. Their users are in China. Their regulators are in China. And their data needs are enormous.

The bots that respect robots.txt are the ones that don't need to scrape aggressively. The ones that ignore it are the ones you actually need to block.

How to Actually Block Them

Blocking dark bots requires more than robots.txt. You need server-level controls: IP range blocks, rate limiting, user agent filtering, and behavioral analysis. If a bot is hitting your site 1,000 times per minute, robots.txt won't stop it — but a firewall will.

The challenge is that these bots rotate IPs. ByteSpider operates from hundreds of IP ranges across multiple countries. Blocking one range just shifts the traffic to another. The only reliable defense is a Web Application Firewall (WAF) with bot detection capabilities — which most small sites don't have.

The Bigger Problem

The existence of dark bots exposes the fundamental flaw in the current AI training model: it assumes good faith. It assumes that companies will respect opt-out signals, honor rate limits, and operate transparently.

But when the incentive is a trillion-dollar AI market, and the penalty for bad behavior is a few angry blog posts, good faith isn't enough. The bots that follow the rules are the ones that don't need to break them. The ones that break the rules are the ones building the next generation of AI models — and they're doing it with your content.

See which bots are accessing your site with State of AI's Bot Analyzer — including the dark crawlers that don't announce themselves.