// AI crawlers
GPTBot, ClaudeBot and the new robots.txt
The new generation of AI crawlers and how to handle them deliberately.
Every major AI provider now operates one or more named web crawlers. They all respect robots.txt – but only if you actually mention them. Default 'User-agent: *' rules cover them, but explicit per-bot rules are the cleanest way to control AI access.
The bots that matter right now
- GPTBot – OpenAI's training and retrieval crawler.
- ChatGPT-User – fetches pages live when a ChatGPT user asks for them.
- OAI-SearchBot – powers ChatGPT's search experience.
- ClaudeBot – Anthropic's training crawler.
- Claude-Web / claude-user – Anthropic's live retrieval agents.
- PerplexityBot – Perplexity's index crawler.
- Perplexity-User – Perplexity's live fetch on behalf of a user.
- Google-Extended – opt-out token for Google's generative training.
- Applebot-Extended – same idea for Apple Intelligence.
- CCBot – Common Crawl, used as a training source by many models.
A sensible default robots.txt
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yourdomain.com/sitemap.xmlBeing explicit costs nothing and signals intent. If a bot ever changes default behaviour, your file still says what you meant.
When to block
Block when content is licensed, paywalled, or genuinely sensitive. Don't block out of vague unease – invisibility is a worse outcome than citation for almost every business.
Common mistakes
- Blocking GPTBot because a checklist somewhere said to.
- Allowing crawl but serving JS-only content the bot can't parse.
- Forgetting to add new bots as providers launch them.
- Returning 403 to AI bots from a misconfigured WAF.
Want the full playbook?
This article is the appetiser. The GEO course covers the same ground in depth – annotated examples, copy-paste templates, real audit walkthroughs, and a 90-day roadmap. Lifetime access, no upsells.
Or just get a heads-up at launch:
