Discovery Spoke · robots.txt
SPOKE · DISCOVERY — HOW AGENTS FIND YOU

robots.txt and agents.txt for Commerce — Allowlisting the AI Agent Bots That Buy.

AI bots that visit your commerce site split into three categories with three different policy defaults. AGENT BUYERS fetch your product pages on a live user's behalf right now — block them and you disappear from every AI-assisted shopping session. AGENT INDEXERS build the context indexes that AI answer surfaces draw from — block them and you vanish from ChatGPT, Perplexity, and Google AI Mode the same way you would vanish from Google if you blocked Googlebot. AI TRAINING CRAWLERS — GPTBot, ClaudeBot, meta-externalagent, and their peers — are operator's call, but never a blanket block. Knowing how to allow AI agent bots in robots.txt by category is the first step in any discovery strategy. robots.txt is the outermost gate of the discovery quadrant (robots.txt → sitemap → llms.txt → /agents). Get this layer right before optimizing the inner ones.

18
Bots Covered
3
Bot Categories
RFC 9309
Governing Standard
4
Discovery Layers
§1 · Discovery as Problem #1

Why agents can't find products — and why robots.txt is the first gate.

The AgentMall Roadmap frames the agent-commerce problem as three sequential failures. Failure #1: agents can't find products. Failure #2: products aren't machine-readable. Failure #3: there's no agent-native checkout. The 4-Layer Agent-Ready Model (Structured Data → API → MCP → UCP) addresses Failures #2 and #3 — but only after an agent has found the page. Discovery is the precondition for everything else.

Traditional SEO assumes a specific journey: human → search engine → ranked link → browser click. robots.txt in that model is a contract between you and Googlebot. AI agents break that assumption at two points. First, agents act — they don't merely retrieve. An AI agent completing a shopping task may parse prices, compare attributes, add to cart, or pass structured data to another agent downstream. Whether your product data is findable by that agent is prior to all of that. Second, discovery now happens through multiple pipes simultaneously. A product page may be surfaced to a human by Google Search, to a ChatGPT user by OAI-SearchBot's index, to a Perplexity user by PerplexityBot's index, and retrieved live by Claude-User during an active shopping session — all in the same hour. These pipes use different user-agents, obey robots.txt independently, and can be controlled separately.

Failure #1 · Discovery

Agents can't find products

robots.txt is the gate. Block the wrong bots and your entire catalog disappears from AI commerce before a single structured-data field is read. This spoke owns the outermost layer.

Failure #2 · Readability

Products aren't machine-readable

Schema.org Product blocks, GTIN, AggregateRating — Layer 1 of the 4-Layer Model. Covered in depth on the Product Data spoke.

Failure #3 · Checkout

No agent-native checkout

Layer 4 / UCP — the agent-native checkout state machine. Covered on the UCP spoke. None of it fires until Discovery is open.

The Discovery Quadrant

This spoke (robots.txt) is one corner of a four-corner discovery stack: robots.txt controls which bots can access which paths; the Agent-Readable Sitemap gives AI crawlers a structured URL inventory; Agent SEO covers the ranking signals AI agents use once they've found you; and the /agents page declares what actions agents can take. Build them in order. robots.txt first — a perfect sitemap is useless if OAI-SearchBot is blocked from reading it.

§2 · Bot Taxonomy

Three categories. Three policies. Never a blanket block.

The single most expensive mistake in AI commerce robots.txt configuration is treating all AI bots as a single category. They are not. The three-bucket taxonomy below maps to fundamentally different policy defaults. Allowlisting all AI bots (naive) and blocking all AI bots (equally naive) are both wrong for a commerce site.

Category 1: AGENT BUYERS — Allow unconditionally

These bots fetch content on a real user's behalf during an active task. Not used for automated crawls or model training. The user is waiting. Blocking these bots means your page never enters that user's decision loop — the equivalent of a physical store locking its door the moment a customer arrives.

BotUser-Agent TokenVendorCommerce Policy
ChatGPT browsingChatGPT-UserOpenAIAllow: / — unconditionally
Claude browsingClaude-UserAnthropicAllow: / — unconditionally
Perplexity browsingPerplexity-UserPerplexityAllow: / (ignores robots.txt anyway per vendor docs)
Amazon / Alexa queriesAmzn-UserAmazonAllow: / — unconditionally
Critical · Perplexity-User Note

Perplexity's documentation states that Perplexity-User "generally ignores robots.txt rules when a user requested the fetch." This means your Allow rule is honored in spirit — the bot will visit your page — but you may not be able to block it through robots.txt alone even if you wanted to. Honor the Allow directive regardless.

Category 2: AGENT INDEXERS — Allow broadly

These bots crawl the web to build indexes that AI answer surfaces draw from. Not used for model training. Functionally analogous to Googlebot but for AI answer layers — blocking them has the same effect as blocking Googlebot: you disappear from the results.

BotUser-Agent TokenVendorCommerce Policy
ChatGPT search indexOAI-SearchBotOpenAIAllow: / — broadly
Perplexity indexPerplexityBotPerplexityAllow: / — broadly
Claude search indexClaude-SearchBotAnthropicAllow: / — broadly
Gemini training/grounding tokenGoogle-ExtendedGoogleAllow: / (control token only — see note)
Spotlight / Siri / SafariApplebotAppleAllow: / — broadly
Apple Intelligence training tokenApplebot-ExtendedAppleAllow: / to stay in Apple Intelligence; Disallow: / to opt out
Alexa / Rufus searchAmzn-SearchBotAmazonAllow: / — broadly
Google-Extended Is a Control Token, Not a Crawler

Google-Extended does not send independent HTTP requests. It is a robots.txt control token that instructs Google's existing crawlers (which have already crawled your content) whether that content may be used for Gemini model training and grounding. Disallowing Google-Extended does not affect Google Search rankings — Google's own documentation is explicit on this point. It only removes your content from Gemini Apps and Vertex AI grounding responses. Keep it Allowed unless you have a specific reason to opt out of Gemini grounding. (Re-verify before launch.)

Category 3: AI TRAINING CRAWLERS — Operator's call

These bots crawl content to train generative models. Not used for search indexing or user-task retrieval. Blocking them does NOT remove you from AI search — they use different user-agent tokens than the indexers and buyers and serve a completely separate function. See §7 for the full SEO-vs-training tension analysis before deciding.

BotUser-Agent TokenVendorDefault Commerce Posture
OpenAI trainingGPTBotOpenAIDisallow: / (optional — see §7 for the nuance)
Anthropic trainingClaudeBotAnthropicDisallow: / (optional)
Meta AI trainingmeta-externalagentMetaDisallow: / (optional)
Amazon trainingAmazonbotAmazonDisallow: / (optional)
ByteDance trainingBytespiderByteDanceDisallow: / + firewall rule (known compliance issues)
Common Crawl trainingCCBotCommon CrawlDisallow: / (optional — feeds open-source models)
§3 · Full Bot Directory

18 bots, verified against vendor docs. User-agent strings included.

The table below covers all 18 AI bots relevant to commerce operators as of this writing. User-agent string version numbers change without notice — always re-verify against vendor documentation before deploying. The user-agent token (the short keyword) is the robots.txt matching key; the full string appears in HTTP headers and server logs. (Re-verify all user-agent strings before launch.)

User-Agent TokenVendorTypeCommerce PolicyVendor Docs
ChatGPT-User OpenAI Agent Buyer Allow: / platform.openai.com/docs/bots
OAI-SearchBot OpenAI Agent Indexer Allow: / platform.openai.com/docs/bots
GPTBot OpenAI Training Crawler Disallow: / (optional — see §7) platform.openai.com/docs/bots
Claude-User Anthropic Agent Buyer Allow: / support.anthropic.com
Claude-SearchBot Anthropic Agent Indexer Allow: / support.anthropic.com
ClaudeBot Anthropic Training Crawler Disallow: / (optional) support.anthropic.com
Perplexity-User Perplexity Agent Buyer Allow: / (ignores robots.txt on user-triggered fetches) docs.perplexity.ai
PerplexityBot Perplexity Agent Indexer Allow: / docs.perplexity.ai
Google-Extended Google Control token (Gemini training/grounding) Allow: / — does not affect Search ranking developers.google.com
Applebot Apple Agent Indexer (Spotlight/Siri/Safari) Allow: / support.apple.com
Applebot-Extended Apple Training opt-out token Allow: / to keep Apple Intelligence; Disallow: / to opt out support.apple.com
Amzn-User Amazon Agent Buyer (Alexa queries) Allow: / developer.amazon.com/amazonbot
Amzn-SearchBot Amazon Agent Indexer (Alexa/Rufus search) Allow: / developer.amazon.com/amazonbot
Amazonbot Amazon Training Crawler Disallow: / (optional) developer.amazon.com/amazonbot
meta-externalagent Meta Training Crawler / AI Indexer Disallow: / (optional) developers.facebook.com
Bytespider ByteDance Training Crawler Disallow: / + firewall rule (documented compliance issues) knownagents.com (re-verify)
CCBot Common Crawl Training Crawler (open dataset) Disallow: / (optional — see §7) commoncrawl.org/ccbot
Diffbot Diffbot Knowledge Graph Indexer Allow: / docs.diffbot.com
Tip · Legacy Anthropic Tokens

The user-agent tokens anthropic-ai and claude-web appear in community blocklists and older log files. Anthropic's current documentation lists ClaudeBot, Claude-User, and Claude-SearchBot as the three primary bots — neither legacy token is listed as active. If you see them in your own logs, add Disallow rules for those tokens and treat them as training crawlers. (Re-verify before launch.)

Full user-agent strings (for log matching)

User-agent tokens are the keys for robots.txt rules. Full strings appear in server logs. Match by substring — version numbers change. (Re-verify all strings before launch.)

TokenFull User-Agent String (re-verify before launch)
ChatGPT-UserMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
OAI-SearchBotMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot
GPTBotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot
Claude-UserClaude-User (full string not published by Anthropic; token is the key)
Claude-SearchBotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +Claude-SearchBot@anthropic.com)
ClaudeBotClaudeBot (token; full string varies — match by substring)
Perplexity-UserMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)
PerplexityBotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
ApplebotMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
Amzn-UserMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amzn-User/0.1) Chrome/119.0.6045.214 Safari/537.36
Amzn-SearchBotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amzn-SearchBot/0.1) Chrome/119.0.6045.214 Safari/537.36
AmazonbotMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
meta-externalagentmeta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
BytespiderMozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
CCBotCCBot/2.0 (https://commoncrawl.org/faq/)
§4 · robots.txt Syntax for Commerce

RFC 9309 rules that matter for commerce configuration.

robots.txt was formalized as IETF RFC 9309 in September 2022, codifying rules that most major bots already follow. The rules below have direct commerce implications — getting any of them wrong can silently open paths you meant to close or close paths you meant to open.

RuleRFC 9309 ReferenceCommerce Implication
Longest-match wins § 2.2.2: "The most specific match found MUST be used." You can open /products/ to a training crawler while closing everything else with Disallow: /. The 9-character /products/ beats the 1-character /.
Allow beats Disallow on ties RFC 9309: equal-length conflicting rules → Allow takes precedence. Google implements this explicitly. If you have both Allow: /products and Disallow: /products (same length), the Allow wins. Google confirms this behavior in its own crawler documentation.
User-agent matching is case-insensitive RFC 9309 §2.1 User-agent: GPTBot and User-agent: gptbot are equivalent. Use the vendor-documented casing for clarity, not correctness.
Paths are case-sensitive RFC 9309 §2.2 Disallow: /Private/ does NOT match /private/. Verify the actual casing your server uses before writing Disallow rules for sensitive paths.
User-agent groups are independent RFC 9309 §2.1 — blank line terminates a group A rule for GPTBot does not apply to ChatGPT-User even though both are from OpenAI. Different user-agent tokens, different groups, different rules. Configure them independently.
Wildcards in paths RFC 9309 §2.2.3 — * matches any character sequence; $ anchors end Disallow: /admin* blocks /admin, /admin/, and /adminpanel. Disallow: /*.pdf$ blocks only URLs ending in .pdf.
Sitemap directive Not part of the RFC 9309 user-agent/rule grammar — universally supported extension Place Sitemap: lines outside all group blocks (at the end of the file or after all group rules). Include the absolute URL. Multiple Sitemap lines are valid. AI indexers use this to accelerate product discovery.
File constraints RFC 9309 §2.5 — crawlers must parse at least 500 KiB Serve at exactly /robots.txt (case-sensitive path). UTF-8 encoding required. Google states a 500 KiB maximum parseable size. Keep yours under 100 KiB in practice.
Crawl-delay Not defined by RFC 9309 Anthropic's ClaudeBot supports it; Google and most major crawlers ignore it. Use it only as a courtesy signal, not as a primary rate-limit mechanism.
Critical · Blocking by IP Breaks robots.txt Reading

Blocking a vendor's published IP ranges at the firewall prevents that bot from reading your robots.txt — Anthropic explicitly warns that IP blocking "may impede Anthropic's ability to read robots.txt." A bot that cannot read robots.txt may treat your site as unconstrained. Use user-agent token rules in robots.txt as your primary control. Use IP blocking only as a supplementary layer for confirmed bad-actor crawlers (e.g., Bytespider after verifying disrespect in your logs).

§5 · Complete Reference robots.txt

A full, deployable file — copy, adapt paths, re-verify versions.

The file below is a complete, runnable robots.txt for an agent-friendly commerce site. Every user-agent token in the file is documented in §3. Adapt the Sitemap URLs and any path-specific rules to match your actual store structure. Re-verify version numbers in user-agent strings before launch — vendors update them without announcement.

robots.txt — Full Allow for Buyers + Indexers, Block for Training Crawlers
# ================================================================
# robots.txt — Agent-Friendly Commerce Configuration
# Last structural review: see Source metadata in your CMS
# Cross-references:
#   /llms.txt       — AI-friendly content layout (see Agent Sitemap spoke)
#   /agents.json    — Agent capability manifest (see /agents page spoke)
# ================================================================

# ----------------------------------------------------------------
# CATEGORY 1: AGENT BUYERS
# These bots fetch pages on a live user's behalf.
# Block them = you disappear from AI-assisted shopping sessions.
# ----------------------------------------------------------------

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Amzn-User
Allow: /

# ----------------------------------------------------------------
# CATEGORY 2: AGENT INDEXERS
# These bots build the indexes that AI surfaces draw from.
# Block them = you disappear from AI answer layers.
# Google-Extended and Applebot-Extended are control tokens,
# not independent crawlers — allow them to stay in Gemini/Siri.
# ----------------------------------------------------------------

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot
Allow: /

# Applebot-Extended: control token for Apple AI training.
# Allow here = your content can improve Apple Intelligence.
# Change to Disallow: / to opt out of Apple model training
# while keeping Spotlight/Siri search inclusion.
User-agent: Applebot-Extended
Allow: /

User-agent: Amzn-SearchBot
Allow: /

User-agent: Diffbot
Allow: /

# ----------------------------------------------------------------
# CATEGORY 3: AI TRAINING CRAWLERS
# These bots train generative models.
# Commerce note: blocking does NOT remove you from AI search.
# See the SEO-vs-Training section below — consider allowing
# /products/ paths if you want your product data in future
# training sets. Current config: full block.
# ----------------------------------------------------------------

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Amazonbot
Disallow: /

# Bytespider has a documented history of ignoring robots.txt.
# Also add a firewall rule at your CDN as a backup layer.
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Legacy/alternate Anthropic tokens — add if seen in logs
User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

# ----------------------------------------------------------------
# CATCH-ALL: Standard crawlers (Googlebot, Bingbot, etc.)
# Allow everything not covered above.
# ----------------------------------------------------------------

User-agent: *
Allow: /

# ----------------------------------------------------------------
# SITEMAP (see Agent Sitemap spoke for full sitemap strategy)
# ----------------------------------------------------------------

Sitemap: https://www.yourstore.com/sitemap.xml
Sitemap: https://www.yourstore.com/sitemap-products.xml

Partial-allow variant — open product pages to training crawlers

If you want training crawlers to learn your product catalog — normalizing your brand, products, and prices into future model knowledge — while protecting blog content, the longest-match rule makes this precise. The /products/ Allow (9 characters) beats the / Disallow (1 character), so product pages are crawled and your blog is not.

Partial-Allow Variant — Products Open, Everything Else Blocked
User-agent: GPTBot
Allow: /products/
Allow: /collections/
Disallow: /

User-agent: ClaudeBot
Allow: /products/
Allow: /collections/
Disallow: /

User-agent: meta-externalagent
Allow: /products/
Allow: /collections/
Disallow: /
Tip · Match Your Platform's URL Structure

The paths /products/ and /collections/ match Shopify's default URL structure. WooCommerce typically uses /product/ (singular) and /product-category/. BigCommerce uses /products/ and /categories/. Verify your store's actual paths before deploying — case-sensitive matching applies. See platform-specific retrofit guides: Shopify, WooCommerce, BigCommerce, Headless, Etsy.

§6 · The Discovery Triad

robots.txt + llms.txt + /agents.json — three files, one discovery layer.

robots.txt is the outermost gate, but it is only one of three files that form the complete discovery layer for an agent-ready commerce site. The three specifications are additive, not competing — each answers a different question for an AI agent arriving at your domain.

FileQuestion It AnswersCurrent StatusDeploy Priority
/robots.txt Can you access this path at all? Bot access policy — which crawlers can fetch which URL patterns. Published standard: RFC 9309 (IETF, September 2022). Universally implemented. Deploy now — required
/llms.txt What is worth reading? AI-friendly content layout — curated Markdown index of key pages for inference-time context. Community proposal by Jeremy Howard at llmstxt.org, September 2024. No formal standards body adoption. Confirmed live 200 at: Anthropic developer docs, Stripe, Cloudflare developer docs, Cursor. (Re-verify before launch.) Deploy now — high value
/agents.json (or /agent-manifest.txt) What can you do here? Agent capability manifest — what actions agents can take, what APIs/MCP servers exist, how to authenticate. Multiple competing proposals; no adopted standard. Covered in full on the /agents page spoke. Deploy as experimental — optional

agents.txt — honest status

Two parallel proposals exist under the "agents.txt" label, and they are not the same specification.

IETF Internet-Draft draft-srijal-agents-policy-00 (filed October 2025 by Srijal Dutta): A strict plaintext policy file at /agents.txt with mandatory SHA-256 hash verification of the file's canonical content. Files with missing or mismatched hashes are treated as fully restrictive (all access denied). The draft is an individual submission — not IETF-endorsed, no assigned working group, and set to expire April 2026. Zero crawlers currently implement it. (Status volatile; re-verify before launch.)

agent-manifest.txt (originally agents.txt, renamed March 2026 by Jasper van Veen): A richer capability manifest covering action permissions, API/MCP server discovery, agent identity tiers, and authentication methods. Currently at Draft v0.3.0, licensed CC BY 4.0, hosted at github.com/jaspervanveen/agents-txt. No vendor implementation. (Status volatile; re-verify before launch.)

ProposalFile PathScopeVendor Implementation
IETF draft-srijal-agents-policy-00 /agents.txt Access policy + SHA-256 hash verification Zero — individual IETF draft, no working group
agent-manifest.txt (van Veen, Draft v0.3.0) /agent-manifest.txt Action permissions + API/MCP discovery + auth methods Zero — community proposal
robots.txt (RFC 9309) /robots.txt Path-level crawl access Universal — all major AI vendors
Practical Advice for Commerce Operators

Do not block the path /agents.txt in robots.txt — you may interfere with future spec adoption. If you want to experiment with capability declaration before a standard is adopted, place a static JSON manifest at /agents.json and reference it from your /llms.txt file. This is compatible with any future standard that emerges. The /agents page spoke covers the full capability manifest pattern including MCP endpoint advertising and OAuth flow declaration.

llms.txt — brief frame for commerce

Jeremy Howard published the llms.txt proposal in September 2024 at llmstxt.org. The proposal asks sites to place a Markdown file at /llms.txt that provides LLM-friendly content: an H1 title, a blockquote summary, and organized links to key pages, optionally with .md versions of those pages for clean parsing. Deep coverage of llms.txt structure, hosting, and verification belongs in the Agent-Readable Sitemap spoke. The one robots.txt implication: make sure your robots.txt does not block AI indexers from reading /llms.txt itself — it would be self-defeating to publish an AI-friendly content guide and then block the bots that would use it.

The 30-Day AgentMall Newsletter

One operator note per week. Bot policy, real patterns.

Field-tested robots.txt configurations, real failure modes from operator logs, and the next discovery spoke as it ships. No fluff. Cancel any time.

§7 · SEO-vs-AI-Training Tension

Why commerce should not blanket-block training crawlers.

News publishers block training crawlers because their product is the article — letting GPTBot train on it means OpenAI can summarize the article without the reader visiting the publisher's site, and revenue evaporates. Commerce stores sell physical or digital goods. The product is the SKU. Training data has a fundamentally different effect on a commerce site.

Benefit · Allow Training

Brand normalization in model weights

Your brand, product names, prices, and attributes get encoded into model weights. Future AI users who ask "what is [your brand]?" get accurate answers. Product schema in training sets helps models understand your catalog structure.

Benefit · Allow Training

Open-source model ecosystem

Common Crawl (CCBot) data feeds hundreds of open-source models and research projects. Blocking CCBot removes your store from that entire ecosystem — including models that power future agent runtimes you cannot predict today.

Benefit · Block Training

Content control

Your product descriptions, pricing strategy, and editorial content are not feeding competitors' training sets. Reduces server load from large-scale crawls. Defensible if your product copy is a genuine differentiator.

The commerce-specific rule of thumb

Allow AGENT INDEXERS unconditionally (they surface you in AI answers). Allow AGENT BUYERS unconditionally (they complete purchases on users' behalf). Make a deliberate choice on training crawlers — not a reflexive block. If your product descriptions are generic (standard SKUs, commodity items), blocking training crawlers costs more than it saves. If your product copy is a genuine differentiator, a selective block using the partial-allow variant in §5 is defensible.

Operator SituationRecommended Training Crawler PostureRationale
Commodity SKUs, standard product descriptions Allow training crawlers (or use partial-allow) Brand normalization value exceeds content protection value; descriptions are not a competitive differentiator
Unique editorial product copy, proprietary descriptions Partial-allow: Allow: /products/, Disallow: / for training crawlers Allow product data for brand normalization; block blog/editorial content that is a differentiator
Aggressive content strategy (buyer guides, reviews) Full block for training crawlers; allow all indexers Content is core IP; blocking training does not affect AI search since indexers are allowed
Early-stage store building brand awareness Allow all training crawlers Brand normalization in early training sets pays compounding dividends; content protection is lower priority at this stage
Critical · Blocking GPTBot ≠ Blocking ChatGPT Shopping

This is the single most common misconception in AI bot policy. GPTBot trains the underlying model. OAI-SearchBot builds the search index ChatGPT answers draw from. ChatGPT-User fetches your page live during a shopping session. These are three separate bots with three separate user-agent tokens. Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User fully active. Configure them independently — a robots.txt rule for one has zero effect on the others.

§8 · Verification

Confirming your rules work — curl, IP allowlists, log analysis.

Writing a robots.txt file is half the job. Verification confirms that the file is served correctly, that your server is not blocking bots at the CDN or firewall layer before they reach robots.txt, and that the right bot user-agent tokens are appearing in your logs on the right paths.

curl Tests

Simulate a specific bot's visit to verify robots.txt behavior. Note: curl -A spoofs the user-agent string in the HTTP request. The response reflects your server's behavior to that user-agent, but robots.txt compliance is evaluated by the bot — use this to verify server-side blocking, not bot obedience.

curl Test Patterns
# Fetch your robots.txt — confirm it's served as text/plain
curl https://www.yourstore.com/robots.txt

# Test whether GPTBot is blocked from your homepage
curl -A "GPTBot" -I https://www.yourstore.com/

# Test whether ChatGPT-User is allowed to a product page
curl -A "ChatGPT-User" https://www.yourstore.com/products/widget-pro

# Test whether OAI-SearchBot is allowed to your sitemap
curl -A "OAI-SearchBot" https://www.yourstore.com/sitemap.xml

# Verify Bytespider gets a 200 or 403 depending on your config
curl -A "Bytespider" -I https://www.yourstore.com/products/

Vendor IP Allowlists — Verifying Bot Identity

User-agent strings can be spoofed. Major vendors publish IP ranges for reverse DNS verification. Use these to confirm that traffic claiming to be a legitimate bot is actually from that vendor.

VendorPublished IP List (re-verify before launch)Verification Method
OpenAI (GPTBot) openai.com/gptbot.json IP allowlist comparison
OpenAI (OAI-SearchBot) openai.com/searchbot.json IP allowlist comparison
OpenAI (ChatGPT-User) openai.com/chatgpt-user.json IP allowlist comparison
PerplexityBot perplexity.com/perplexitybot.json IP allowlist comparison
Perplexity-User perplexity.com/perplexity-user.json IP allowlist comparison
CCBot index.commoncrawl.org/ccbot.json IP allowlist + reverse DNS
Googlebot Google IP ranges Reverse DNS to *.googlebot.com or *.google.com + forward-verify
Amazonbot developer.amazon.com/amazonbot/ip-addresses/ IP allowlist comparison
Meta whois -h whois.radb.net -- '-i origin AS32934' | grep ^route ASN lookup (IP ranges change frequently)
Reverse DNS Verification (Googlebot)
# Step 1: Reverse DNS lookup of the suspicious IP
host <IP-FROM-LOG>
# Should resolve to *.googlebot.com or *.google.com

# Step 2: Forward-verify — resolved hostname should map back to original IP
host <resolved-hostname>
# Result should = original IP

Google Search Console

The robots.txt report in Google Search Console shows parse errors and which rules Google applied. The URL Inspection tool shows whether a specific URL is blocked by robots.txt. Submit your robots.txt for immediate re-parse after any change.

Cloudflare AI Bot Controls

Cloudflare's managed robots.txt (Security → Bots → Configure Bot Fight Mode → Instruct bot traffic with robots.txt) automatically prepends directives blocking known AI training crawlers. Useful as a base layer, but generates a blanket block — review the generated output carefully to ensure it does not block your AGENT BUYER and AGENT INDEXER categories. The Cloudflare-generated directives may include tokens that overlap with indexers depending on the version of the managed list. (Re-verify Cloudflare directive set before launch.)

Log Analysis Pattern

Parse your access logs for AI bot activity. Most declared AI bots include a recognizable substring in the user-agent header. Run this grep against your Nginx or Apache access log:

grep Pattern — All Known AI Bot Tokens
grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Amazonbot|Amzn-SearchBot|Amzn-User|meta-externalagent|Bytespider|CCBot|Applebot|Google-Extended|Diffbot" /var/log/nginx/access.log

Pipe to a frequency count by user-agent token to see which bots dominate your traffic:

grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Amazonbot|Amzn-SearchBot|Amzn-User|meta-externalagent|Bytespider|CCBot|Applebot|Diffbot" /var/log/nginx/access.log \
  | awk '{print $12}' \
  | sort | uniq -c | sort -rn \
  | head -20
§10 · Common Mistakes

Eight robots.txt errors that cut you off from AI commerce.

1. Wrong user-agent token for ChatGPT browsing

Using ChatGPT as the user-agent token — it is not valid. The bot that fetches on a user's behalf uses the token ChatGPT-User. The training crawler uses GPTBot. These are two entirely separate bots with separate tokens, separate functions, and separate robots.txt rules. A rule written for ChatGPT matches nothing. A rule written for GPTBot has zero effect on ChatGPT-User.

2. Blanket-blocking all AI bots and losing buyer-agent traffic

Adding a User-agent: * / Disallow: / block or applying Cloudflare's managed robots.txt without review catches all AI bots including Agent Buyers and Agent Indexers. The correct structure: explicit Allow: / rules for all AGENT BUYERS and AGENT INDEXERS by their specific user-agent tokens, then explicit Disallow: / for training crawlers you want to exclude, then User-agent: * / Allow: / for everything else. The specific rules override the wildcard for those bots.

3. Assuming blocking GPTBot stops ChatGPT from finding you

GPTBot trains the model weights. OAI-SearchBot builds the search index ChatGPT answers draw from. ChatGPT-User fetches your page live during a shopping session. Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User fully active — your products still surface in ChatGPT shopping. If you want to block all OpenAI access, you need separate Disallow: / rules for all three tokens independently.

4. Using meta robots noai/noimageai tags as the primary control

The HTML meta directives <meta name="robots" content="noai, noimageai"> are not part of RFC 9309 and are not universally respected. Meta (the company) explicitly uses robots.txt, not NoAI tags. Anthropic, Google, and OpenAI do not document support for these tags. Use robots.txt user-agent blocks as your primary control. HTML meta tags may evolve, but robots.txt is the standard today.

5. Blocking by IP instead of user-agent token

Blocking a vendor's published IP ranges at the firewall prevents the bot from reading your robots.txt at all — Anthropic explicitly warns that IP blocking "may impede Anthropic's ability to read robots.txt." A bot that cannot read robots.txt may treat your site as unconstrained. Use IP blocking only as a supplementary layer for confirmed bad-actor crawlers (Bytespider with documented compliance issues) after setting the correct user-agent token rule in robots.txt first.

6. Forgetting the Sitemap directive

Without a Sitemap: directive, AI indexers may still crawl your site by following links — but they start discovery from whatever links they already know. The Sitemap directive is not part of the formal RFC 9309 user-agent/rule grammar, but it is universally supported as an extension. Add it at the end of your robots.txt file to give every crawler — AI and traditional — a systematic URL inventory. For the full sitemap strategy, see the Agent-Readable Sitemap spoke.

7. Case-sensitive path errors in Disallow rules

RFC 9309 specifies that paths in robots.txt are case-sensitive. Disallow: /Private/ does not match /private/. Disallow: /Admin/ does not match /admin/. Verify the actual URL casing your server uses before writing Disallow rules for sensitive paths — especially on case-insensitive file systems (most Windows servers) where the URL paths may look equivalent but robots.txt matching is case-sensitive regardless.

8. Misconfiguring Google-Extended and losing Gemini grounding

Google-Extended does not send HTTP requests. It is a robots.txt control token that instructs Google's existing crawlers whether already-crawled content may be used for Gemini model training and grounding. Disallowing it does not protect you from Googlebot, does not affect Search rankings, and does not stop Google from knowing your content exists. It only removes your content from Gemini Apps responses and Vertex AI grounding. If you want to stay in Gemini while keeping Google Search rankings, keep Google-Extended Allowed.

§11 · FAQ

Frequently asked questions.

Will blocking GPTBot stop ChatGPT shopping from finding me?

No. GPTBot trains OpenAI's foundation models. The bots that surface your products to ChatGPT users are OAI-SearchBot (which builds ChatGPT's search index) and ChatGPT-User (which fetches your page live when a user asks a question). All three have different user-agent tokens. A Disallow: / for GPTBot only affects training data collection. To stop ChatGPT shopping traffic entirely, you would need separate Disallow: / rules for OAI-SearchBot and ChatGPT-User as well.

Do AI bots actually respect robots.txt?

Major commercial vendors (OpenAI, Anthropic, Perplexity, Google, Amazon, Apple, Meta) state in their documentation that their bots respect robots.txt. Empirically, the compliance record is mixed. Bytespider has a documented history of ignoring robots.txt directives. A 2024 investigation found that Perplexity-User was sending generic Chrome user-agent strings instead of its declared Perplexity-User token, effectively bypassing robots.txt rules. Common Crawl (CCBot) reports compliance. For crawlers with a compliance record of concern, layer server-side user-agent blocking or firewall rules on top of robots.txt.

Should I block Bytespider?

Yes, with robots.txt and a firewall rule. Bytespider is operated by ByteDance (TikTok's parent company). Stack Overflow users and independent researchers have documented that Bytespider ignores robots.txt and does not read the file before crawling. Adding a Disallow: / rule for Bytespider in robots.txt is the correct signal; however, given the compliance history, also add a user-agent string firewall block at your CDN or server to catch requests matching Bytespider in the user-agent header.

What's the difference between GPTBot and ChatGPT-User?

GPTBot is an automated crawler that harvests web content to train OpenAI's foundation models. It runs continuously and at scale. ChatGPT-User is a user-triggered fetcher — when a ChatGPT user (or a Custom GPT) asks a question that requires visiting a specific URL, that single request is sent with the ChatGPT-User token. GPTBot is the training pipeline; ChatGPT-User is the live shopping/browsing pipeline. Block one, allow the other — or configure them completely independently.

Does Google-Extended affect my Google Search ranking?

No. Google's documentation is explicit: "Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search." Setting Disallow: / for Google-Extended only removes your content from Gemini model training datasets and from grounding responses in Gemini Apps and Vertex AI. Your organic rankings in Google Search are determined by Googlebot's standard crawl, which Google-Extended does not control.

Can I allow AI crawlers only to product pages and block them from my blog?

Yes. The longest-match rule in RFC 9309 makes this precise. For example: User-agent: GPTBot / Allow: /products/ / Allow: /collections/ / Disallow: / — the /products/ Allow (9 characters) is more specific than the / Disallow (1 character), so product pages are crawled and blog posts are not. Test with: curl -A "GPTBot" -I https://yourstore.com/products/widget-pro to verify server-side behavior, then confirm in Google Search Console that your robots.txt is parsed as intended.

How do I verify a bot is really from OpenAI (or Anthropic, Perplexity, etc.)?

User-agent strings can be spoofed. Verification requires checking the source IP against published IP ranges: 1. Find the IP address of the request in your access logs. 2. Check it against the vendor's published JSON list (e.g., https://openai.com/gptbot.json for GPTBot). 3. For Google: perform a reverse DNS lookup (host IP_ADDRESS) — the result should resolve to *.googlebot.com or *.google.com. Then forward-verify that the resolved hostname points back to the original IP. If the IP is not in the published range and does not pass reverse DNS verification, the user-agent string is spoofed. Treat it as an unauthorized crawler and block at the firewall level.

What about agents.txt — is it real yet?

Not yet, in any implemented form. As of this writing, two competing proposals use the "agents.txt" label. An IETF individual Internet-Draft (draft-srijal-agents-policy-00, October 2025) proposes a strict policy file at /agents.txt with SHA-256 hash verification and a fail-closed approach (malformed file = fully restricted). A separate community proposal by Jasper van Veen (originally agents.txt, renamed to agent-manifest.txt in March 2026) covers richer capability declaration including API endpoints, action permissions, and MCP server discovery. Neither proposal has been adopted by any major AI vendor. Zero crawlers currently implement either spec. Deploy robots.txt as your primary bot policy control. Place an experimental manifest at /agents.json if you want to signal capability early. Monitor https://datatracker.ietf.org/doc/draft-srijal-agents-policy/ for standards progression.

§12 · Step-by-Step

Configuring robots.txt for agent commerce, in five steps.

Each step mirrors the HowTo JSON-LD at the top of this page word for word. Execute in order — the output of each step is the input to the next.

Step 1 — Inventory bots in your logs

Pull the last 30 days of access logs and extract all user-agent strings that include known AI bot tokens. Use the grep pattern from the Verification section. Create a spreadsheet with columns: token, request count, paths accessed, source IPs. This baseline tells you which bots are actually hitting your site today and whether any paths you expected to be accessible are getting 403s or redirects.

Step 2 — Classify each bot into buyer / indexer / training

Using the Bot Directory table and the taxonomy summary, assign each active bot to one of the three categories. Flag any unrecognized tokens for manual research — look them up against the vendor's official documentation, not community blocklists. Pay attention to bots where the classification is ambiguous (e.g., meta-externalagent is described by Meta as covering both AI indexing and training).

Step 3 — Draft your allow/block policy

For each category, apply the commerce-default rules from this document: Allow all buyers, Allow all indexers, make a deliberate choice on training crawlers (all-block, selective path allow, or full allow). Document the business rationale for each training crawler decision — you will revisit this as AI commerce matures. Decide whether you want Google-Extended and Applebot-Extended Allowed (your content may improve future Gemini/Apple Intelligence responses) or Disallowed (your content stays out of those training pipelines but you remain in Google Search and Spotlight).

Step 4 — Deploy robots.txt + cross-references

Replace or update your robots.txt with the complete runnable file from the Reference robots.txt section, adjusted for your actual paths. Add the Sitemap: directive with your product sitemap URL. Publish /llms.txt linking to your key product pages. Publish /agents.json with your store's API surface. Verify robots.txt is accessible at exactly https://www.yourstore.com/robots.txt with Content-Type: text/plain and UTF-8 encoding. Submit the updated robots.txt in Google Search Console for immediate re-parse.

Step 5 — Verify with curl, logs, and vendor tools

Run curl tests for each major bot on representative URLs: your homepage, a product page, a collection page, and any admin or private paths that should be blocked. Check Google Search Console's robots.txt report for parse errors. Pull logs 48–72 hours after deployment to confirm that AGENT BUYER tokens are appearing on product pages and that training crawler tokens are not. Set a calendar reminder to re-check vendor bot documentation quarterly — user-agent strings and version numbers change, and new bots are introduced without notice.

§13 · Continue the Guide

Next stops in the Discovery quadrant.

The Window

The window for getting agent discovery right is now.

Every quarter, the floor moves up. User-agent tokens once ignored by bots are now enforced. Google-Extended is now a real signal for Gemini grounding eligibility. OAI-SearchBot is indexing at scale. The merchants who configure their bot policy correctly now — allowing buyers and indexers unconditionally, making deliberate choices on training crawlers, and deploying the full three-file discovery triad — get compounding benefit from every new AI commerce channel that launches. The merchants who ship a blanket block or leave their default robots.txt will spend the back half of 2026 debugging why their products don't surface in AI shopping results.

Open the AgentMall Roadmap →
The 30-Day AgentMall Newsletter

One AgentMall note per week.

Discovery stack patterns, real bot policy failure modes from operator logs, and the next spoke the morning it ships. No fluff. Cancel any time.