AI Opt-Out Controls
Robots.txt and ai.txt templates to limit AI crawling, training, and referencing. Server-side controls for managing how autonomous agents interact with your content, including Nginx/WAF patterns, monitoring, and verification strategies.
Other Roles
01
Blocking Common AI User-Agents
The first line of defence is robots.txt. But the landscape of AI crawlers is fragmented and evolving - GPTBot, ClaudeBot, Bytespider, Google-Extended, and dozens more. A comprehensive blocking strategy requires ongoing maintenance.
Maintain a living robots.txt that blocks known AI training crawlers by user-agent string - GPTBot (OpenAI), ClaudeBot/anthropic-ai (Anthropic), Bytespider (ByteDance), CCBot (Common Crawl), and Google-Extended.
Distinguish between AI training crawlers and AI retrieval agents - blocking GPTBot prevents training on your content but may also block ChatGPT's browsing feature; decide whether you want to block training, retrieval, or both.
Implement user-agent detection at the server level (Nginx, Apache, or CDN edge rules) as a backup to robots.txt - not all crawlers respect robots.txt, and server-level blocking provides enforcement.
Monitor your access logs for unidentified high-frequency crawlers that may be AI training bots operating without declared user-agent strings - the absence of identification is itself a signal.
Review and update your blocked user-agent list monthly - new AI crawlers appear regularly, and the user-agent strings used by existing crawlers change without notice.
02
ai.txt Policy Implementation
ai.txt is an emerging standard (proposed by Spawning.ai) that provides granular control over how AI systems may use your content - beyond the binary allow/disallow of robots.txt. It specifies training, scraping, and referencing permissions separately.
Create an ai.txt file at your domain root that declares separate policies for training (using content to train models), scraping (extracting content for retrieval), and referencing (citing content in responses).
Specify per-directory or per-content-type policies - you may want to allow AI referencing of your public research while blocking training on your proprietary frameworks or premium content.
Include a human-readable policy statement alongside the machine-readable directives - ai.txt serves both as a technical control and as a public declaration of your AI content governance position.
Link your ai.txt to your Terms of Service and Copyright notices - the legal enforceability of ai.txt is still evolving, and explicit cross-referencing strengthens your position.
Implement TDMRep (Text and Data Mining Reservation Protocol) headers alongside ai.txt for EU compliance - the EU DSM Directive recognises machine-readable opt-out signals for text and data mining.
03
Server-Side Control Patterns
Nginx, Apache, WAF rules, and CDN edge logic provide enforcement-level control over AI agent access. These patterns go beyond advisory signals to actively manage how agents interact with your infrastructure.
Implement rate limiting per user-agent class at the Nginx or CDN level - AI crawlers can generate 10-100x the request volume of human traffic, and unthrottled crawling degrades performance for human users.
Use Cloudflare, AWS WAF, or equivalent rules to challenge or block requests from known AI crawler IP ranges - this provides enforcement even when crawlers ignore robots.txt directives.
Implement conditional content serving that returns different responses to AI agents - you can serve a summary with a link to the full content, rather than the full content itself, to control what agents ingest.
Add X-Robots-Tag HTTP headers to API responses and non-HTML content (PDFs, images, JSON) - robots.txt only governs crawlable URLs, while X-Robots-Tag controls indexing of content served through any endpoint.
Build an agent access dashboard that visualises AI crawler traffic patterns, blocked requests, and policy compliance - visibility into agent behaviour is the foundation of effective governance.
04
Monitoring and Verification
Opt-out controls are only effective if you can verify compliance. Monitoring strategies for detecting unauthorised AI crawling, measuring policy effectiveness, and maintaining ongoing governance.
Implement canary tokens - unique text strings embedded in your content that, if they appear in AI model outputs, prove your content was ingested despite opt-out signals.
Monitor AI search engines (Perplexity, ChatGPT Browse, Google AI Overviews) for citations of your content - if your content appears in AI-generated answers despite blocking, your controls have gaps.
Set up automated alerts for unusual traffic patterns that match AI crawling signatures - high-frequency sequential page requests, systematic URL traversal, and API endpoint enumeration.
Conduct quarterly audits of your opt-out controls by testing against the latest AI crawler user-agents and IP ranges - the landscape shifts faster than most organisations update their controls.
Document your AI content governance decisions and their rationale - as regulation evolves (EU AI Act, UK AI Safety Institute guidance), you will need to demonstrate that your controls are intentional and maintained.
Related Reading
Go Deeper
Explore the essays and frameworks that underpin this guide.
Observatory Essays
The Consent Horizon
The boundary between permitted and prohibited agent access - consent in the agentic web.
Know Your Agent
Identifying which agents are accessing your content and governing their behaviour.
Autonomous Integrity
Maintaining content integrity when autonomous systems interact with your resources.