Your robots.txt file might be costing you AI visibility and you would never know.
A Hostinger analysis of 66.7 billion web crawler requests revealed a striking pattern: websites are increasingly welcoming AI search bots while blocking AI training crawlers, and the gap is widening dramatically. OAI-SearchBot (OpenAI’s search crawler) now has 55.67% website coverage. Meanwhile, GPTBot (OpenAI’s training crawler) dropped from 84% coverage to just 12%.
This is the new robots.txt reality. AI companies have split their crawlers into separate user agents for search and training, and your robots.txt needs to reflect that distinction. Block the wrong bot and you disappear from ChatGPT’s search results. Allow the wrong bot and your content feeds someone’s training data for free.
Here is the complete playbook for getting it right in 2026.
The Bot Landscape: Search vs. Training
The single most important concept in AI-era robots.txt management is the separation between search bots and training bots. Every major AI company now operates multiple crawlers, each with a distinct purpose.
OpenAI’s Bot Family
| Bot Name | Purpose | Recommendation |
|---|---|---|
| OAI-SearchBot | Powers ChatGPT search results | Allow |
| ChatGPT-User | Fetches pages when users share URLs in chat | Allow |
| GPTBot | Crawls content for model training | Block (unless you want to contribute training data) |
This is the split that matters most. OAI-SearchBot is how your content appears when someone asks ChatGPT a question. GPTBot is how OpenAI collects training data for future models. Blocking GPTBot has zero impact on your ChatGPT search visibility. Blocking OAI-SearchBot removes you from ChatGPT search entirely.
Anthropic’s Bot Family
Anthropic clarified its crawler documentation in February 2026, making the distinction explicit. According to Search Engine Land:
| Bot Name | Purpose | Recommendation |
|---|---|---|
| ClaudeBot | Powers Claude’s web search and retrieval | Allow |
| ClaudeBot-Training | Crawls content for model training | Block (optional) |
The naming convention here is cleaner than OpenAI’s. ClaudeBot-Training is clearly labeled. If you block ClaudeBot, Claude cannot cite your content in responses.
Google’s Bot Family
| Bot Name | Purpose | Recommendation |
|---|---|---|
| Googlebot | Standard search indexing | Allow |
| Google-Extended | Gemini training and AI features | Block (if you do not want training contribution) |
| GoogleOther | Supplementary crawling for various Google products | Allow |
Google-Extended was introduced specifically to give publishers control over AI training contribution without affecting their Google Search visibility. Blocking it has no impact on your organic Google rankings or AI Overview appearances.
Other AI Bots
| Bot Name | Platform | Purpose | Recommendation |
|---|---|---|---|
| PerplexityBot | Perplexity | Search and retrieval | Allow |
| Bingbot | Microsoft/Copilot | Search indexing + Copilot | Allow |
| Applebot | Apple/Siri | Search and AI features | Allow |
| Applebot-Extended | Apple | AI training | Block (optional) |
| CCBot | Common Crawl | Training data | Block |
| Bytespider | TikTok/ByteDance | Training data | Block |

The Recommended Robots.txt Template
Based on the current bot landscape and the principle of maximizing AI search visibility while minimizing unwanted training data contribution, here is the recommended robots.txt configuration for 2026:
# ================================
# AI Search Bots - ALLOW
# These power search results in AI platforms
# ================================
# OpenAI ChatGPT Search
User-agent: OAI-SearchBot
Allow: /
# ChatGPT User URL fetching
User-agent: ChatGPT-User
Allow: /
# Perplexity Search
User-agent: PerplexityBot
Allow: /
# Anthropic Claude Search
User-agent: ClaudeBot
Allow: /
# Microsoft Bing / Copilot
User-agent: Bingbot
Allow: /
# Apple / Siri
User-agent: Applebot
Allow: /
# Google supplementary
User-agent: GoogleOther
Allow: /
# ================================
# AI Training Bots - BLOCK
# These collect data for model training
# ================================
# OpenAI Training
User-agent: GPTBot
Disallow: /
# Google Gemini Training
User-agent: Google-Extended
Disallow: /
# Apple AI Training
User-agent: Applebot-Extended
Disallow: /
# Anthropic Training
User-agent: ClaudeBot-Training
Disallow: /
# Common Crawl (training data)
User-agent: CCBot
Disallow: /
# ByteDance / TikTok
User-agent: Bytespider
Disallow: /
# ================================
# Standard Search Engines - ALLOW
# ================================
User-agent: Googlebot
Allow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Why this specific order matters
Robots.txt parsing follows the most specific match. By listing AI search bots with explicit Allow directives before a general Disallow for training bots, you ensure that search-oriented crawlers are explicitly permitted even if you have broader restrictions later in the file.
The Cloudflare Complication
If you use Cloudflare (and over 30% of websites do), there is an additional layer to manage. Cloudflare’s AI Crawl Control feature, part of their dashboard, can inject additional robots.txt directives at the edge.
According to Cloudflare’s documentation, when managed robots.txt is enabled, Cloudflare may add or modify directives before they reach the crawler. This means the robots.txt you see on your server might not match what bots actually receive.
Action required: If you use Cloudflare’s managed robots.txt feature, always verify the live version by fetching https://yourdomain.com/robots.txt from an external network (not from within Cloudflare’s dashboard). The NytroSEO guide recommends checking this monthly, especially after Cloudflare updates.
Also check Cloudflare’s AI Bot Management settings in your dashboard. Cloudflare offers one-click toggles for AI training bots that can override your robots.txt configuration at the WAF level.
The Nuanced Decision: Partial Access
The binary allow/block approach works for most sites. But some publishers benefit from a more granular strategy.
Allow search bots to specific sections only
If you want ChatGPT to cite your blog content but not crawl your product pages (where pricing or inventory might be sensitive):
User-agent: OAI-SearchBot
Allow: /blog/
Allow: /resources/
Allow: /guides/
Disallow: /products/
Disallow: /pricing/
Disallow: /account/
This gives you AI search visibility for content marketing assets while protecting commercial pages from being directly quoted in AI responses.
The blog-only strategy for e-commerce
E-commerce sites face a specific tension: they want AI engines to recommend their products but do not want product descriptions scraped verbatim. A middle ground:
User-agent: OAI-SearchBot
Allow: /blog/
Allow: /guides/
Allow: /reviews/
Allow: /collections/
Disallow: /products/*/variants
Disallow: /cart/
Disallow: /checkout/
This allows AI engines to find and cite your content marketing and category pages while blocking granular product variant pages that change frequently.
Time-delayed access
Some publishers use CDN-level rules to delay AI bot access to new content by 24-48 hours, ensuring that paying subscribers or email list members see content first before it becomes available to AI engines. This is not configurable via robots.txt alone; it requires server-side or CDN-level logic.
The Data on What Happens When You Get It Wrong
Blocking AI search bots has measurable consequences. Here is what the data shows:
Sites that block OAI-SearchBot lose approximately 100% of their ChatGPT search citations (obviously). But the downstream effects are less obvious: they also lose traffic from users who would have found them via ChatGPT and then visited their site, created backlinks, or shared the content.
Sites that block all AI bots indiscriminately (common with a User-agent: * Disallow targeting all non-Google bots) lose visibility across the entire AI engine ecosystem. Given the 527% growth in AI referral traffic, this is an increasingly costly mistake.
Sites that allow training bots contribute to the training data that makes AI engines better at answering questions about their domain. Some publishers view this as a net positive (more accurate AI responses about their brand) while others view it as giving away intellectual property.
The Hostinger data tells us the market consensus: allow search, block training. That 55.67% OAI-SearchBot coverage versus 12% GPTBot coverage represents the equilibrium most webmasters have reached.
llms.txt: The Complementary File
While robots.txt controls crawler access, llms.txt is a newer protocol that helps AI engines understand your site’s structure and priority content. Think of robots.txt as the bouncer (who gets in) and llms.txt as the concierge (where to go once inside).
Thousands of sites now implement llms.txt, although no major AI platform has officially committed to parsing it. The protocol lets you specify:
- Priority pages that should be cited first
- Site structure and content hierarchy
- Author expertise signals and credentials
- Content freshness indicators
A complete AI visibility strategy uses both files: robots.txt for access control, llms.txt for content guidance. They are complementary, not competing.
Testing and Verification
After updating your robots.txt, verify that it works correctly:
1. Google’s robots.txt tester
Google Search Console includes a robots.txt tester. While it primarily validates Googlebot rules, it shows how your file will be parsed generally.
2. Manual verification for AI bots
Fetch your live robots.txt and confirm each AI bot’s rules. Use a tool like curl:
curl -s https://yourdomain.com/robots.txt | grep -A 2 "OAI-SearchBot"
3. Monitor AI citations after changes
After updating robots.txt, monitor your AI visibility for 2-4 weeks. If you previously blocked AI search bots and now allow them, expect a gradual increase in AI citations as bots recrawl your content. The timeline varies by platform: ChatGPT’s crawl cadence differs from Perplexity’s, which differs from Claude’s.
Tools like iScore.ai track citation changes across multiple AI engines, making it straightforward to measure the impact of robots.txt changes on your AI visibility score.
4. Check Cloudflare or CDN overrides
If you use a CDN with bot management features, verify that CDN-level rules are not overriding your robots.txt directives. This is the most common source of “I updated robots.txt but nothing changed” issues.
The Crawl Budget Consideration
AI search bots create additional crawl load on your server. For large sites with millions of pages, this is a legitimate consideration.
Each AI platform crawls at different rates:
- OAI-SearchBot respects crawl-delay directives and typically crawls at a moderate pace
- PerplexityBot has been criticized for aggressive crawling in the past but has improved
- ClaudeBot generally maintains conservative crawl rates
If crawl budget is a concern, consider:
User-agent: OAI-SearchBot
Allow: /
Crawl-delay: 5
User-agent: PerplexityBot
Allow: /
Crawl-delay: 10
Note: Not all bots honor crawl-delay. It is a suggestion, not an enforcement mechanism. For strict rate limiting, use server-side rules.
The Strategic Framework
Here is the decision matrix for your robots.txt AI bot policy:
| Your Priority | Search Bots | Training Bots | llms.txt |
|---|---|---|---|
| Maximum AI visibility | Allow all | Allow all | Yes, comprehensive |
| Balanced (recommended) | Allow all | Block all | Yes, comprehensive |
| Protective | Allow selectively | Block all | Yes, limited |
| Maximum control | Block all | Block all | No |
Most businesses should use the “balanced” approach: maximum AI search visibility combined with training data protection. This is what the market data says most webmasters have already concluded.
The “maximum control” approach (blocking everything) is appropriate only for sites with genuinely proprietary content where any AI exposure creates business risk, such as premium research databases or subscription-only publications.
The Coming Standards Battle
The current robots.txt approach to AI bot management is a pragmatic hack, not a permanent solution. The protocol was designed in 1994 for a world with a handful of search engines. In 2026, there are dozens of AI bots with different purposes, and the list grows monthly.
Industry groups are working on more sophisticated standards: machine-readable licensing terms, granular content permissions, and compensation frameworks. The TDM (Text and Data Mining) Reservation Protocol from the W3C and the AI Data Alliance’s proposed framework are both in development.
Until those standards mature, robots.txt remains the primary mechanism for controlling AI bot access. Keep your configuration current, verify it regularly, and update it as new bots emerge.
Check your brand’s AI visibility score at iscore.ai
Frequently Asked Questions
Does blocking GPTBot affect my ChatGPT search visibility?
No. OpenAI separates its crawlers into distinct user agents. GPTBot collects data for model training. OAI-SearchBot powers ChatGPT’s search functionality. Blocking GPTBot has zero impact on whether ChatGPT cites your content in search responses. You can safely block GPTBot while allowing OAI-SearchBot to maintain full ChatGPT search visibility.
What percentage of websites currently block AI training bots?
According to Hostinger’s 2026 analysis of 66.7 billion web crawler requests, GPTBot (OpenAI’s training crawler) now has only 12% website coverage, down from 84% previously. This means approximately 88% of analyzed websites block GPTBot. Conversely, OAI-SearchBot (OpenAI’s search crawler) has 55.67% coverage, indicating that more than half of websites explicitly allow AI search crawling while blocking training crawling.
Should I implement llms.txt alongside robots.txt?
Yes. The two files serve complementary purposes. robots.txt controls which bots can access your site and which pages they can crawl (access control). llms.txt helps AI engines understand your site structure, priority content, and expertise signals (content guidance). While no major AI platform has officially committed to parsing llms.txt, thousands of sites have already adopted it, and implementing it now positions you for future adoption. There is no downside to having both files.
How do I verify that Cloudflare is not overriding my robots.txt?
Fetch your live robots.txt from outside your Cloudflare network by using curl or a web browser from a different network: curl https://yourdomain.com/robots.txt. Compare the output to your origin server’s robots.txt file. If they differ, Cloudflare’s managed robots.txt feature may be injecting directives at the edge. Check your Cloudflare dashboard under Security > Bots > AI Bot Management to review any CDN-level overrides.
Will new AI bots require robots.txt updates?
Yes. New AI crawlers emerge regularly as new platforms launch and existing platforms add capabilities. Review your robots.txt quarterly and whenever a major AI platform announces new crawler user agents. Subscribe to announcements from Search Engine Land, Search Engine Journal, and The Searchless Journal for coverage of new AI bot deployments and their robots.txt user agents.
Check your brand’s AI visibility score at iscore.ai
