The 2026 Robots.txt Playbook: How to Allow AI Search Bots While Blocking AI Training Crawlers

Your robots.txt file might be costing you AI visibility and you would never know.

A Hostinger analysis of 66.7 billion web crawler requests revealed a striking pattern: websites are increasingly welcoming AI search bots while blocking AI training crawlers, and the gap is widening dramatically. OAI-SearchBot (OpenAI’s search crawler) now has 55.67% website coverage. Meanwhile, GPTBot (OpenAI’s training crawler) dropped from 84% coverage to just 12%.

This is the new robots.txt reality. AI companies have split their crawlers into separate user agents for search and training, and your robots.txt needs to reflect that distinction. Block the wrong bot and you disappear from ChatGPT’s search results. Allow the wrong bot and your content feeds someone’s training data for free.

Here is the complete playbook for getting it right in 2026.

The Bot Landscape: Search vs. Training

The single most important concept in AI-era robots.txt management is the separation between search bots and training bots. Every major AI company now operates multiple crawlers, each with a distinct purpose.

OpenAI’s Bot Family

Bot Name	Purpose	Recommendation
OAI-SearchBot	Powers ChatGPT search results	Allow
ChatGPT-User	Fetches pages when users share URLs in chat	Allow
GPTBot	Crawls content for model training	Block (unless you want to contribute training data)

This is the split that matters most. OAI-SearchBot is how your content appears when someone asks ChatGPT a question. GPTBot is how OpenAI collects training data for future models. Blocking GPTBot has zero impact on your ChatGPT search visibility. Blocking OAI-SearchBot removes you from ChatGPT search entirely.

Anthropic’s Bot Family

Anthropic clarified its crawler documentation in February 2026, making the distinction explicit. According to Search Engine Land:

Bot Name	Purpose	Recommendation
ClaudeBot	Powers Claude’s web search and retrieval	Allow
ClaudeBot-Training	Crawls content for model training	Block (optional)

The naming convention here is cleaner than OpenAI’s. ClaudeBot-Training is clearly labeled. If you block ClaudeBot, Claude cannot cite your content in responses.

Google’s Bot Family

Bot Name	Purpose	Recommendation
Googlebot	Standard search indexing	Allow
Google-Extended	Gemini training and AI features	Block (if you do not want training contribution)
GoogleOther	Supplementary crawling for various Google products	Allow

Google-Extended was introduced specifically to give publishers control over AI training contribution without affecting their Google Search visibility. Blocking it has no impact on your organic Google rankings or AI Overview appearances.

Other AI Bots

Bot Name	Platform	Purpose	Recommendation
PerplexityBot	Perplexity	Search and retrieval	Allow
Bingbot	Microsoft/Copilot	Search indexing + Copilot	Allow
Applebot	Apple/Siri	Search and AI features	Allow
Applebot-Extended	Apple	AI training	Block (optional)
CCBot	Common Crawl	Training data	Block
Bytespider	TikTok/ByteDance	Training data	Block

The Recommended Robots.txt Template

Based on the current bot landscape and the principle of maximizing AI search visibility while minimizing unwanted training data contribution, here is the recommended robots.txt configuration for 2026:

# ================================
# AI Search Bots - ALLOW
# These power search results in AI platforms
# ================================

# OpenAI ChatGPT Search
User-agent: OAI-SearchBot
Allow: /

# ChatGPT User URL fetching
User-agent: ChatGPT-User
Allow: /

# Perplexity Search
User-agent: PerplexityBot
Allow: /

# Anthropic Claude Search
User-agent: ClaudeBot
Allow: /

# Microsoft Bing / Copilot
User-agent: Bingbot
Allow: /

# Apple / Siri
User-agent: Applebot
Allow: /

# Google supplementary
User-agent: GoogleOther
Allow: /

# ================================
# AI Training Bots - BLOCK
# These collect data for model training
# ================================

# OpenAI Training
User-agent: GPTBot
Disallow: /

# Google Gemini Training
User-agent: Google-Extended
Disallow: /

# Apple AI Training
User-agent: Applebot-Extended
Disallow: /

# Anthropic Training
User-agent: ClaudeBot-Training
Disallow: /

# Common Crawl (training data)
User-agent: CCBot
Disallow: /

# ByteDance / TikTok
User-agent: Bytespider
Disallow: /

# ================================
# Standard Search Engines - ALLOW
# ================================

User-agent: Googlebot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Why this specific order matters

Robots.txt parsing follows the most specific match. By listing AI search bots with explicit Allow directives before a general Disallow for training bots, you ensure that search-oriented crawlers are explicitly permitted even if you have broader restrictions later in the file.

The Cloudflare Complication

If you use Cloudflare (and over 30% of websites do), there is an additional layer to manage. Cloudflare’s AI Crawl Control feature, part of their dashboard, can inject additional robots.txt directives at the edge.

According to Cloudflare’s documentation, when managed robots.txt is enabled, Cloudflare may add or modify directives before they reach the crawler. This means the robots.txt you see on your server might not match what bots actually receive.

Action required: If you use Cloudflare’s managed robots.txt feature, always verify the live version by fetching https://yourdomain.com/robots.txt from an external network (not from within Cloudflare’s dashboard). The NytroSEO guide recommends checking this monthly, especially after Cloudflare updates.

Also check Cloudflare’s AI Bot Management settings in your dashboard. Cloudflare offers one-click toggles for AI training bots that can override your robots.txt configuration at the WAF level.

The Nuanced Decision: Partial Access

The binary allow/block approach works for most sites. But some publishers benefit from a more granular strategy.

Allow search bots to specific sections only

If you want ChatGPT to cite your blog content but not crawl your product pages (where pricing or inventory might be sensitive):

User-agent: OAI-SearchBot
Allow: /blog/
Allow: /resources/
Allow: /guides/
Disallow: /products/
Disallow: /pricing/
Disallow: /account/

This gives you AI search visibility for content marketing assets while protecting commercial pages from being directly quoted in AI responses.

The blog-only strategy for e-commerce

E-commerce sites face a specific tension: they want AI engines to recommend their products but do not want product descriptions scraped verbatim. A middle ground:

User-agent: OAI-SearchBot
Allow: /blog/
Allow: /guides/
Allow: /reviews/
Allow: /collections/
Disallow: /products/*/variants
Disallow: /cart/
Disallow: /checkout/

This allows AI engines to find and cite your content marketing and category pages while blocking granular product variant pages that change frequently.

Time-delayed access

Some publishers use CDN-level rules to delay AI bot access to new content by 24-48 hours, ensuring that paying subscribers or email list members see content first before it becomes available to AI engines. This is not configurable via robots.txt alone; it requires server-side or CDN-level logic.

The Data on What Happens When You Get It Wrong

Blocking AI search bots has measurable consequences. Here is what the data shows:

Sites that block OAI-SearchBot lose approximately 100% of their ChatGPT search citations (obviously). But the downstream effects are less obvious: they also lose traffic from users who would have found them via ChatGPT and then visited their site, created backlinks, or shared the content.

Sites that block all AI bots indiscriminately (common with a User-agent: * Disallow targeting all non-Google bots) lose visibility across the entire AI engine ecosystem. Given the 527% growth in AI referral traffic, this is an increasingly costly mistake.

Sites that allow training bots contribute to the training data that makes AI engines better at answering questions about their domain. Some publishers view this as a net positive (more accurate AI responses about their brand) while others view it as giving away intellectual property.

The Hostinger data tells us the market consensus: allow search, block training. That 55.67% OAI-SearchBot coverage versus 12% GPTBot coverage represents the equilibrium most webmasters have reached.

llms.txt: The Complementary File

While robots.txt controls crawler access, llms.txt is a newer protocol that helps AI engines understand your site’s structure and priority content. Think of robots.txt as the bouncer (who gets in) and llms.txt as the concierge (where to go once inside).

Thousands of sites now implement llms.txt, although no major AI platform has officially committed to parsing it. The protocol lets you specify:

Priority pages that should be cited first
Site structure and content hierarchy
Author expertise signals and credentials
Content freshness indicators

A complete AI visibility strategy uses both files: robots.txt for access control, llms.txt for content guidance. They are complementary, not competing.

Testing and Verification

After updating your robots.txt, verify that it works correctly:

1. Google’s robots.txt tester

Google Search Console includes a robots.txt tester. While it primarily validates Googlebot rules, it shows how your file will be parsed generally.

2. Manual verification for AI bots

Fetch your live robots.txt and confirm each AI bot’s rules. Use a tool like curl:

curl -s https://yourdomain.com/robots.txt | grep -A 2 "OAI-SearchBot"

3. Monitor AI citations after changes

After updating robots.txt, monitor your AI visibility for 2-4 weeks. If you previously blocked AI search bots and now allow them, expect a gradual increase in AI citations as bots recrawl your content. The timeline varies by platform: ChatGPT’s crawl cadence differs from Perplexity’s, which differs from Claude’s.

Tools like iScore.ai track citation changes across multiple AI engines, making it straightforward to measure the impact of robots.txt changes on your AI visibility score.

4. Check Cloudflare or CDN overrides

If you use a CDN with bot management features, verify that CDN-level rules are not overriding your robots.txt directives. This is the most common source of “I updated robots.txt but nothing changed” issues.

The Crawl Budget Consideration

AI search bots create additional crawl load on your server. For large sites with millions of pages, this is a legitimate consideration.

Each AI platform crawls at different rates:

OAI-SearchBot respects crawl-delay directives and typically crawls at a moderate pace
PerplexityBot has been criticized for aggressive crawling in the past but has improved
ClaudeBot generally maintains conservative crawl rates

If crawl budget is a concern, consider:

User-agent: OAI-SearchBot
Allow: /
Crawl-delay: 5

User-agent: PerplexityBot
Allow: /
Crawl-delay: 10

Note: Not all bots honor crawl-delay. It is a suggestion, not an enforcement mechanism. For strict rate limiting, use server-side rules.

The Strategic Framework

Here is the decision matrix for your robots.txt AI bot policy:

Your Priority	Search Bots	Training Bots	llms.txt
Maximum AI visibility	Allow all	Allow all	Yes, comprehensive
Balanced (recommended)	Allow all	Block all	Yes, comprehensive
Protective	Allow selectively	Block all	Yes, limited
Maximum control	Block all	Block all	No

Most businesses should use the “balanced” approach: maximum AI search visibility combined with training data protection. This is what the market data says most webmasters have already concluded.

The “maximum control” approach (blocking everything) is appropriate only for sites with genuinely proprietary content where any AI exposure creates business risk, such as premium research databases or subscription-only publications.

The Coming Standards Battle

The current robots.txt approach to AI bot management is a pragmatic hack, not a permanent solution. The protocol was designed in 1994 for a world with a handful of search engines. In 2026, there are dozens of AI bots with different purposes, and the list grows monthly.

Industry groups are working on more sophisticated standards: machine-readable licensing terms, granular content permissions, and compensation frameworks. The TDM (Text and Data Mining) Reservation Protocol from the W3C and the AI Data Alliance’s proposed framework are both in development.

Until those standards mature, robots.txt remains the primary mechanism for controlling AI bot access. Keep your configuration current, verify it regularly, and update it as new bots emerge.

Check your brand’s AI visibility score at iscore.ai

Frequently Asked Questions

Does blocking GPTBot affect my ChatGPT search visibility?

No. OpenAI separates its crawlers into distinct user agents. GPTBot collects data for model training. OAI-SearchBot powers ChatGPT’s search functionality. Blocking GPTBot has zero impact on whether ChatGPT cites your content in search responses. You can safely block GPTBot while allowing OAI-SearchBot to maintain full ChatGPT search visibility.

What percentage of websites currently block AI training bots?

According to Hostinger’s 2026 analysis of 66.7 billion web crawler requests, GPTBot (OpenAI’s training crawler) now has only 12% website coverage, down from 84% previously. This means approximately 88% of analyzed websites block GPTBot. Conversely, OAI-SearchBot (OpenAI’s search crawler) has 55.67% coverage, indicating that more than half of websites explicitly allow AI search crawling while blocking training crawling.

Should I implement llms.txt alongside robots.txt?

Yes. The two files serve complementary purposes. robots.txt controls which bots can access your site and which pages they can crawl (access control). llms.txt helps AI engines understand your site structure, priority content, and expertise signals (content guidance). While no major AI platform has officially committed to parsing llms.txt, thousands of sites have already adopted it, and implementing it now positions you for future adoption. There is no downside to having both files.

How do I verify that Cloudflare is not overriding my robots.txt?

Fetch your live robots.txt from outside your Cloudflare network by using curl or a web browser from a different network: curl https://yourdomain.com/robots.txt. Compare the output to your origin server’s robots.txt file. If they differ, Cloudflare’s managed robots.txt feature may be injecting directives at the edge. Check your Cloudflare dashboard under Security > Bots > AI Bot Management to review any CDN-level overrides.

Will new AI bots require robots.txt updates?

Yes. New AI crawlers emerge regularly as new platforms launch and existing platforms add capabilities. Review your robots.txt quarterly and whenever a major AI platform announces new crawler user agents. Subscribe to announcements from Search Engine Land, Search Engine Journal, and The Searchless Journal for coverage of new AI bot deployments and their robots.txt user agents.

Check your brand’s AI visibility score at iscore.ai

The Bot Landscape: Search vs. Training#

OpenAI’s Bot Family#

Anthropic’s Bot Family#

Google’s Bot Family#

Other AI Bots#

The Recommended Robots.txt Template#

Why this specific order matters#

The Cloudflare Complication#

The Nuanced Decision: Partial Access#

Allow search bots to specific sections only#

The blog-only strategy for e-commerce#

Time-delayed access#

The Data on What Happens When You Get It Wrong#

llms.txt: The Complementary File#

Testing and Verification#

1. Google’s robots.txt tester#

2. Manual verification for AI bots#

3. Monitor AI citations after changes#

4. Check Cloudflare or CDN overrides#

The Crawl Budget Consideration#

The Strategic Framework#

The Coming Standards Battle#

Frequently Asked Questions#

Does blocking GPTBot affect my ChatGPT search visibility?#

What percentage of websites currently block AI training bots?#

Should I implement llms.txt alongside robots.txt?#

How do I verify that Cloudflare is not overriding my robots.txt?#

Will new AI bots require robots.txt updates?#