§1 · Discovery as Problem #1
Why agents can't find products — and why robots.txt is the first gate.
The AgentMall Roadmap frames the agent-commerce problem as three sequential failures. Failure #1: agents can't find products. Failure #2: products aren't machine-readable. Failure #3: there's no agent-native checkout. The 4-Layer Agent-Ready Model (Structured Data → API → MCP → UCP) addresses Failures #2 and #3 — but only after an agent has found the page. Discovery is the precondition for everything else.
Traditional SEO assumes a specific journey: human → search engine → ranked link → browser click. robots.txt in that model is a contract between you and Googlebot. AI agents break that assumption at two points. First, agents act — they don't merely retrieve. An AI agent completing a shopping task may parse prices, compare attributes, add to cart, or pass structured data to another agent downstream. Whether your product data is findable by that agent is prior to all of that. Second, discovery now happens through multiple pipes simultaneously. A product page may be surfaced to a human by Google Search, to a ChatGPT user by OAI-SearchBot's index, to a Perplexity user by PerplexityBot's index, and retrieved live by Claude-User during an active shopping session — all in the same hour. These pipes use different user-agents, obey robots.txt independently, and can be controlled separately.
Failure #1 · Discovery
Agents can't find products
robots.txt is the gate. Block the wrong bots and your entire catalog disappears from AI commerce before a single structured-data field is read. This spoke owns the outermost layer.
Failure #2 · Readability
Products aren't machine-readable
Schema.org Product blocks, GTIN, AggregateRating — Layer 1 of the 4-Layer Model. Covered in depth on the Product Data spoke.
Failure #3 · Checkout
No agent-native checkout
Layer 4 / UCP — the agent-native checkout state machine. Covered on the UCP spoke. None of it fires until Discovery is open.
The Discovery Quadrant
This spoke (robots.txt) is one corner of a four-corner discovery stack: robots.txt controls which bots can access which paths; the Agent-Readable Sitemap gives AI crawlers a structured URL inventory; Agent SEO covers the ranking signals AI agents use once they've found you; and the /agents page declares what actions agents can take. Build them in order. robots.txt first — a perfect sitemap is useless if OAI-SearchBot is blocked from reading it.
§2 · Bot Taxonomy
Three categories. Three policies. Never a blanket block.
The single most expensive mistake in AI commerce robots.txt configuration is treating all AI bots as a single category. They are not. The three-bucket taxonomy below maps to fundamentally different policy defaults. Allowlisting all AI bots (naive) and blocking all AI bots (equally naive) are both wrong for a commerce site.
Category 1: AGENT BUYERS — Allow unconditionally
These bots fetch content on a real user's behalf during an active task. Not used for automated crawls or model training. The user is waiting. Blocking these bots means your page never enters that user's decision loop — the equivalent of a physical store locking its door the moment a customer arrives.
| Bot | User-Agent Token | Vendor | Commerce Policy |
| ChatGPT browsing | ChatGPT-User | OpenAI | Allow: / — unconditionally |
| Claude browsing | Claude-User | Anthropic | Allow: / — unconditionally |
| Perplexity browsing | Perplexity-User | Perplexity | Allow: / (ignores robots.txt anyway per vendor docs) |
| Amazon / Alexa queries | Amzn-User | Amazon | Allow: / — unconditionally |
Critical · Perplexity-User Note
Perplexity's documentation states that Perplexity-User "generally ignores robots.txt rules when a user requested the fetch." This means your Allow rule is honored in spirit — the bot will visit your page — but you may not be able to block it through robots.txt alone even if you wanted to. Honor the Allow directive regardless.
Category 2: AGENT INDEXERS — Allow broadly
These bots crawl the web to build indexes that AI answer surfaces draw from. Not used for model training. Functionally analogous to Googlebot but for AI answer layers — blocking them has the same effect as blocking Googlebot: you disappear from the results.
| Bot | User-Agent Token | Vendor | Commerce Policy |
| ChatGPT search index | OAI-SearchBot | OpenAI | Allow: / — broadly |
| Perplexity index | PerplexityBot | Perplexity | Allow: / — broadly |
| Claude search index | Claude-SearchBot | Anthropic | Allow: / — broadly |
| Gemini training/grounding token | Google-Extended | Google | Allow: / (control token only — see note) |
| Spotlight / Siri / Safari | Applebot | Apple | Allow: / — broadly |
| Apple Intelligence training token | Applebot-Extended | Apple | Allow: / to stay in Apple Intelligence; Disallow: / to opt out |
| Alexa / Rufus search | Amzn-SearchBot | Amazon | Allow: / — broadly |
Google-Extended Is a Control Token, Not a Crawler
Google-Extended does not send independent HTTP requests. It is a robots.txt control token that instructs Google's existing crawlers (which have already crawled your content) whether that content may be used for Gemini model training and grounding. Disallowing Google-Extended does not affect Google Search rankings — Google's own documentation is explicit on this point. It only removes your content from Gemini Apps and Vertex AI grounding responses. Keep it Allowed unless you have a specific reason to opt out of Gemini grounding. (Re-verify before launch.)
Category 3: AI TRAINING CRAWLERS — Operator's call
These bots crawl content to train generative models. Not used for search indexing or user-task retrieval. Blocking them does NOT remove you from AI search — they use different user-agent tokens than the indexers and buyers and serve a completely separate function. See §7 for the full SEO-vs-training tension analysis before deciding.
| Bot | User-Agent Token | Vendor | Default Commerce Posture |
| OpenAI training | GPTBot | OpenAI | Disallow: / (optional — see §7 for the nuance) |
| Anthropic training | ClaudeBot | Anthropic | Disallow: / (optional) |
| Meta AI training | meta-externalagent | Meta | Disallow: / (optional) |
| Amazon training | Amazonbot | Amazon | Disallow: / (optional) |
| ByteDance training | Bytespider | ByteDance | Disallow: / + firewall rule (known compliance issues) |
| Common Crawl training | CCBot | Common Crawl | Disallow: / (optional — feeds open-source models) |
§3 · Full Bot Directory
18 bots, verified against vendor docs. User-agent strings included.
The table below covers all 18 AI bots relevant to commerce operators as of this writing. User-agent string version numbers change without notice — always re-verify against vendor documentation before deploying. The user-agent token (the short keyword) is the robots.txt matching key; the full string appears in HTTP headers and server logs. (Re-verify all user-agent strings before launch.)
Tip · Legacy Anthropic Tokens
The user-agent tokens anthropic-ai and claude-web appear in community blocklists and older log files. Anthropic's current documentation lists ClaudeBot, Claude-User, and Claude-SearchBot as the three primary bots — neither legacy token is listed as active. If you see them in your own logs, add Disallow rules for those tokens and treat them as training crawlers. (Re-verify before launch.)
Full user-agent strings (for log matching)
User-agent tokens are the keys for robots.txt rules. Full strings appear in server logs. Match by substring — version numbers change. (Re-verify all strings before launch.)
| Token | Full User-Agent String (re-verify before launch) |
ChatGPT-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot |
OAI-SearchBot | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot |
GPTBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot |
Claude-User | Claude-User (full string not published by Anthropic; token is the key) |
Claude-SearchBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +Claude-SearchBot@anthropic.com) |
ClaudeBot | ClaudeBot (token; full string varies — match by substring) |
Perplexity-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user) |
PerplexityBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
Applebot | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) |
Amzn-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amzn-User/0.1) Chrome/119.0.6045.214 Safari/537.36 |
Amzn-SearchBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amzn-SearchBot/0.1) Chrome/119.0.6045.214 Safari/537.36 |
Amazonbot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 |
meta-externalagent | meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) |
Bytespider | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) |
CCBot | CCBot/2.0 (https://commoncrawl.org/faq/) |
§4 · robots.txt Syntax for Commerce
RFC 9309 rules that matter for commerce configuration.
robots.txt was formalized as IETF RFC 9309 in September 2022, codifying rules that most major bots already follow. The rules below have direct commerce implications — getting any of them wrong can silently open paths you meant to close or close paths you meant to open.
| Rule | RFC 9309 Reference | Commerce Implication |
| Longest-match wins |
§ 2.2.2: "The most specific match found MUST be used." |
You can open /products/ to a training crawler while closing everything else with Disallow: /. The 9-character /products/ beats the 1-character /. |
| Allow beats Disallow on ties |
RFC 9309: equal-length conflicting rules → Allow takes precedence. Google implements this explicitly. |
If you have both Allow: /products and Disallow: /products (same length), the Allow wins. Google confirms this behavior in its own crawler documentation. |
| User-agent matching is case-insensitive |
RFC 9309 §2.1 |
User-agent: GPTBot and User-agent: gptbot are equivalent. Use the vendor-documented casing for clarity, not correctness. |
| Paths are case-sensitive |
RFC 9309 §2.2 |
Disallow: /Private/ does NOT match /private/. Verify the actual casing your server uses before writing Disallow rules for sensitive paths. |
| User-agent groups are independent |
RFC 9309 §2.1 — blank line terminates a group |
A rule for GPTBot does not apply to ChatGPT-User even though both are from OpenAI. Different user-agent tokens, different groups, different rules. Configure them independently. |
| Wildcards in paths |
RFC 9309 §2.2.3 — * matches any character sequence; $ anchors end |
Disallow: /admin* blocks /admin, /admin/, and /adminpanel. Disallow: /*.pdf$ blocks only URLs ending in .pdf. |
| Sitemap directive |
Not part of the RFC 9309 user-agent/rule grammar — universally supported extension |
Place Sitemap: lines outside all group blocks (at the end of the file or after all group rules). Include the absolute URL. Multiple Sitemap lines are valid. AI indexers use this to accelerate product discovery. |
| File constraints |
RFC 9309 §2.5 — crawlers must parse at least 500 KiB |
Serve at exactly /robots.txt (case-sensitive path). UTF-8 encoding required. Google states a 500 KiB maximum parseable size. Keep yours under 100 KiB in practice. |
| Crawl-delay |
Not defined by RFC 9309 |
Anthropic's ClaudeBot supports it; Google and most major crawlers ignore it. Use it only as a courtesy signal, not as a primary rate-limit mechanism. |
Critical · Blocking by IP Breaks robots.txt Reading
Blocking a vendor's published IP ranges at the firewall prevents that bot from reading your robots.txt — Anthropic explicitly warns that IP blocking "may impede Anthropic's ability to read robots.txt." A bot that cannot read robots.txt may treat your site as unconstrained. Use user-agent token rules in robots.txt as your primary control. Use IP blocking only as a supplementary layer for confirmed bad-actor crawlers (e.g., Bytespider after verifying disrespect in your logs).
§5 · Complete Reference robots.txt
A full, deployable file — copy, adapt paths, re-verify versions.
The file below is a complete, runnable robots.txt for an agent-friendly commerce site. Every user-agent token in the file is documented in §3. Adapt the Sitemap URLs and any path-specific rules to match your actual store structure. Re-verify version numbers in user-agent strings before launch — vendors update them without announcement.
robots.txt — Full Allow for Buyers + Indexers, Block for Training Crawlers
# ================================================================
# robots.txt — Agent-Friendly Commerce Configuration
# Last structural review: see Source metadata in your CMS
# Cross-references:
# /llms.txt — AI-friendly content layout (see Agent Sitemap spoke)
# /agents.json — Agent capability manifest (see /agents page spoke)
# ================================================================
# ----------------------------------------------------------------
# CATEGORY 1: AGENT BUYERS
# These bots fetch pages on a live user's behalf.
# Block them = you disappear from AI-assisted shopping sessions.
# ----------------------------------------------------------------
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Amzn-User
Allow: /
# ----------------------------------------------------------------
# CATEGORY 2: AGENT INDEXERS
# These bots build the indexes that AI surfaces draw from.
# Block them = you disappear from AI answer layers.
# Google-Extended and Applebot-Extended are control tokens,
# not independent crawlers — allow them to stay in Gemini/Siri.
# ----------------------------------------------------------------
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot
Allow: /
# Applebot-Extended: control token for Apple AI training.
# Allow here = your content can improve Apple Intelligence.
# Change to Disallow: / to opt out of Apple model training
# while keeping Spotlight/Siri search inclusion.
User-agent: Applebot-Extended
Allow: /
User-agent: Amzn-SearchBot
Allow: /
User-agent: Diffbot
Allow: /
# ----------------------------------------------------------------
# CATEGORY 3: AI TRAINING CRAWLERS
# These bots train generative models.
# Commerce note: blocking does NOT remove you from AI search.
# See the SEO-vs-Training section below — consider allowing
# /products/ paths if you want your product data in future
# training sets. Current config: full block.
# ----------------------------------------------------------------
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Amazonbot
Disallow: /
# Bytespider has a documented history of ignoring robots.txt.
# Also add a firewall rule at your CDN as a backup layer.
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# Legacy/alternate Anthropic tokens — add if seen in logs
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
# ----------------------------------------------------------------
# CATCH-ALL: Standard crawlers (Googlebot, Bingbot, etc.)
# Allow everything not covered above.
# ----------------------------------------------------------------
User-agent: *
Allow: /
# ----------------------------------------------------------------
# SITEMAP (see Agent Sitemap spoke for full sitemap strategy)
# ----------------------------------------------------------------
Sitemap: https://www.yourstore.com/sitemap.xml
Sitemap: https://www.yourstore.com/sitemap-products.xml
Partial-allow variant — open product pages to training crawlers
If you want training crawlers to learn your product catalog — normalizing your brand, products, and prices into future model knowledge — while protecting blog content, the longest-match rule makes this precise. The /products/ Allow (9 characters) beats the / Disallow (1 character), so product pages are crawled and your blog is not.
Partial-Allow Variant — Products Open, Everything Else Blocked
User-agent: GPTBot
Allow: /products/
Allow: /collections/
Disallow: /
User-agent: ClaudeBot
Allow: /products/
Allow: /collections/
Disallow: /
User-agent: meta-externalagent
Allow: /products/
Allow: /collections/
Disallow: /
Tip · Match Your Platform's URL Structure
The paths /products/ and /collections/ match Shopify's default URL structure. WooCommerce typically uses /product/ (singular) and /product-category/. BigCommerce uses /products/ and /categories/. Verify your store's actual paths before deploying — case-sensitive matching applies. See platform-specific retrofit guides: Shopify, WooCommerce, BigCommerce, Headless, Etsy.
§6 · The Discovery Triad
robots.txt + llms.txt + /agents.json — three files, one discovery layer.
robots.txt is the outermost gate, but it is only one of three files that form the complete discovery layer for an agent-ready commerce site. The three specifications are additive, not competing — each answers a different question for an AI agent arriving at your domain.
| File | Question It Answers | Current Status | Deploy Priority |
/robots.txt |
Can you access this path at all? Bot access policy — which crawlers can fetch which URL patterns. |
Published standard: RFC 9309 (IETF, September 2022). Universally implemented. |
Deploy now — required |
/llms.txt |
What is worth reading? AI-friendly content layout — curated Markdown index of key pages for inference-time context. |
Community proposal by Jeremy Howard at llmstxt.org, September 2024. No formal standards body adoption. Confirmed live 200 at: Anthropic developer docs, Stripe, Cloudflare developer docs, Cursor. (Re-verify before launch.) |
Deploy now — high value |
/agents.json (or /agent-manifest.txt) |
What can you do here? Agent capability manifest — what actions agents can take, what APIs/MCP servers exist, how to authenticate. |
Multiple competing proposals; no adopted standard. Covered in full on the /agents page spoke. |
Deploy as experimental — optional |
agents.txt — honest status
Two parallel proposals exist under the "agents.txt" label, and they are not the same specification.
IETF Internet-Draft draft-srijal-agents-policy-00 (filed October 2025 by Srijal Dutta): A strict plaintext policy file at /agents.txt with mandatory SHA-256 hash verification of the file's canonical content. Files with missing or mismatched hashes are treated as fully restrictive (all access denied). The draft is an individual submission — not IETF-endorsed, no assigned working group, and set to expire April 2026. Zero crawlers currently implement it. (Status volatile; re-verify before launch.)
agent-manifest.txt (originally agents.txt, renamed March 2026 by Jasper van Veen): A richer capability manifest covering action permissions, API/MCP server discovery, agent identity tiers, and authentication methods. Currently at Draft v0.3.0, licensed CC BY 4.0, hosted at github.com/jaspervanveen/agents-txt. No vendor implementation. (Status volatile; re-verify before launch.)
| Proposal | File Path | Scope | Vendor Implementation |
| IETF draft-srijal-agents-policy-00 |
/agents.txt |
Access policy + SHA-256 hash verification |
Zero — individual IETF draft, no working group |
| agent-manifest.txt (van Veen, Draft v0.3.0) |
/agent-manifest.txt |
Action permissions + API/MCP discovery + auth methods |
Zero — community proposal |
| robots.txt (RFC 9309) |
/robots.txt |
Path-level crawl access |
Universal — all major AI vendors |
Practical Advice for Commerce Operators
Do not block the path /agents.txt in robots.txt — you may interfere with future spec adoption. If you want to experiment with capability declaration before a standard is adopted, place a static JSON manifest at /agents.json and reference it from your /llms.txt file. This is compatible with any future standard that emerges. The /agents page spoke covers the full capability manifest pattern including MCP endpoint advertising and OAuth flow declaration.
llms.txt — brief frame for commerce
Jeremy Howard published the llms.txt proposal in September 2024 at llmstxt.org. The proposal asks sites to place a Markdown file at /llms.txt that provides LLM-friendly content: an H1 title, a blockquote summary, and organized links to key pages, optionally with .md versions of those pages for clean parsing. Deep coverage of llms.txt structure, hosting, and verification belongs in the Agent-Readable Sitemap spoke. The one robots.txt implication: make sure your robots.txt does not block AI indexers from reading /llms.txt itself — it would be self-defeating to publish an AI-friendly content guide and then block the bots that would use it.
The 30-Day AgentMall Newsletter
One operator note per week. Bot policy, real patterns.
Field-tested robots.txt configurations, real failure modes from operator logs, and the next discovery spoke as it ships. No fluff. Cancel any time.
§7 · SEO-vs-AI-Training Tension
Why commerce should not blanket-block training crawlers.
News publishers block training crawlers because their product is the article — letting GPTBot train on it means OpenAI can summarize the article without the reader visiting the publisher's site, and revenue evaporates. Commerce stores sell physical or digital goods. The product is the SKU. Training data has a fundamentally different effect on a commerce site.
Benefit · Allow Training
Brand normalization in model weights
Your brand, product names, prices, and attributes get encoded into model weights. Future AI users who ask "what is [your brand]?" get accurate answers. Product schema in training sets helps models understand your catalog structure.
Benefit · Allow Training
Open-source model ecosystem
Common Crawl (CCBot) data feeds hundreds of open-source models and research projects. Blocking CCBot removes your store from that entire ecosystem — including models that power future agent runtimes you cannot predict today.
Benefit · Block Training
Content control
Your product descriptions, pricing strategy, and editorial content are not feeding competitors' training sets. Reduces server load from large-scale crawls. Defensible if your product copy is a genuine differentiator.
The commerce-specific rule of thumb
Allow AGENT INDEXERS unconditionally (they surface you in AI answers). Allow AGENT BUYERS unconditionally (they complete purchases on users' behalf). Make a deliberate choice on training crawlers — not a reflexive block. If your product descriptions are generic (standard SKUs, commodity items), blocking training crawlers costs more than it saves. If your product copy is a genuine differentiator, a selective block using the partial-allow variant in §5 is defensible.
| Operator Situation | Recommended Training Crawler Posture | Rationale |
| Commodity SKUs, standard product descriptions |
Allow training crawlers (or use partial-allow) |
Brand normalization value exceeds content protection value; descriptions are not a competitive differentiator |
| Unique editorial product copy, proprietary descriptions |
Partial-allow: Allow: /products/, Disallow: / for training crawlers |
Allow product data for brand normalization; block blog/editorial content that is a differentiator |
| Aggressive content strategy (buyer guides, reviews) |
Full block for training crawlers; allow all indexers |
Content is core IP; blocking training does not affect AI search since indexers are allowed |
| Early-stage store building brand awareness |
Allow all training crawlers |
Brand normalization in early training sets pays compounding dividends; content protection is lower priority at this stage |
Critical · Blocking GPTBot ≠ Blocking ChatGPT Shopping
This is the single most common misconception in AI bot policy. GPTBot trains the underlying model. OAI-SearchBot builds the search index ChatGPT answers draw from. ChatGPT-User fetches your page live during a shopping session. These are three separate bots with three separate user-agent tokens. Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User fully active. Configure them independently — a robots.txt rule for one has zero effect on the others.
§8 · Verification
Confirming your rules work — curl, IP allowlists, log analysis.
Writing a robots.txt file is half the job. Verification confirms that the file is served correctly, that your server is not blocking bots at the CDN or firewall layer before they reach robots.txt, and that the right bot user-agent tokens are appearing in your logs on the right paths.
curl Tests
Simulate a specific bot's visit to verify robots.txt behavior. Note: curl -A spoofs the user-agent string in the HTTP request. The response reflects your server's behavior to that user-agent, but robots.txt compliance is evaluated by the bot — use this to verify server-side blocking, not bot obedience.
curl Test Patterns
# Fetch your robots.txt — confirm it's served as text/plain
curl https://www.yourstore.com/robots.txt
# Test whether GPTBot is blocked from your homepage
curl -A "GPTBot" -I https://www.yourstore.com/
# Test whether ChatGPT-User is allowed to a product page
curl -A "ChatGPT-User" https://www.yourstore.com/products/widget-pro
# Test whether OAI-SearchBot is allowed to your sitemap
curl -A "OAI-SearchBot" https://www.yourstore.com/sitemap.xml
# Verify Bytespider gets a 200 or 403 depending on your config
curl -A "Bytespider" -I https://www.yourstore.com/products/
Vendor IP Allowlists — Verifying Bot Identity
User-agent strings can be spoofed. Major vendors publish IP ranges for reverse DNS verification. Use these to confirm that traffic claiming to be a legitimate bot is actually from that vendor.
Reverse DNS Verification (Googlebot)
# Step 1: Reverse DNS lookup of the suspicious IP
host <IP-FROM-LOG>
# Should resolve to *.googlebot.com or *.google.com
# Step 2: Forward-verify — resolved hostname should map back to original IP
host <resolved-hostname>
# Result should = original IP
Google Search Console
The robots.txt report in Google Search Console shows parse errors and which rules Google applied. The URL Inspection tool shows whether a specific URL is blocked by robots.txt. Submit your robots.txt for immediate re-parse after any change.
Cloudflare AI Bot Controls
Cloudflare's managed robots.txt (Security → Bots → Configure Bot Fight Mode → Instruct bot traffic with robots.txt) automatically prepends directives blocking known AI training crawlers. Useful as a base layer, but generates a blanket block — review the generated output carefully to ensure it does not block your AGENT BUYER and AGENT INDEXER categories. The Cloudflare-generated directives may include tokens that overlap with indexers depending on the version of the managed list. (Re-verify Cloudflare directive set before launch.)
Log Analysis Pattern
Parse your access logs for AI bot activity. Most declared AI bots include a recognizable substring in the user-agent header. Run this grep against your Nginx or Apache access log:
grep Pattern — All Known AI Bot Tokens
grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Amazonbot|Amzn-SearchBot|Amzn-User|meta-externalagent|Bytespider|CCBot|Applebot|Google-Extended|Diffbot" /var/log/nginx/access.log
Pipe to a frequency count by user-agent token to see which bots dominate your traffic:
grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Amazonbot|Amzn-SearchBot|Amzn-User|meta-externalagent|Bytespider|CCBot|Applebot|Diffbot" /var/log/nginx/access.log \
| awk '{print $12}' \
| sort | uniq -c | sort -rn \
| head -20
§9 · Discovery Stack + Platform Context
Every layer of the AgentMall guide — cross-linked.
robots.txt is the outermost gate of a multi-layer agent-readiness stack. The three other discovery spokes, the platform retrofit guides, and the core protocol spokes are all linked below. Each one builds on top of what you configure here.
Discovery Siblings — the other three corners
Discovery · Spoke 2
sitemap.xml extensions, the full llms.txt format and adoption table, /agents.json overview, and per-crawler crawl-behavior differences. Once robots.txt is open, this spoke covers the URL inventory layer.
Discovery · Spoke 3
The ranking signals AI agents use once they've found you — VERIFIED, INFERRED, and SPECULATION-labeled, per the confidence-label discipline. GTIN is the single highest-leverage fix for most stores.
Discovery · Spoke 4
Capability manifest pattern — the 8 canonical fields, format choices (JSON/Markdown/hybrid), URL placement options, MCP endpoint advertising, and OAuth flow declaration. The full action surface declaration.
Platform Retrofit Guides — your specific stack
Platform
Schema.org via Liquid, native MCP at /api/mcp, UCP catalog enrollment, Cart API checkout path.
Platform
WooCommerce REST API, Schema.org extensions, custom MCP wrapper, agent-ready checkout flow.
Platform
BigCommerce Catalog API, GraphQL Storefront API, Schema.org via Stencil, agent checkout handoff.
Platform
Headless CMS agent readiness — structured content, custom API surface, MCP server patterns for headless stacks.
Core Protocol Spokes
§10 · Common Mistakes
Eight robots.txt errors that cut you off from AI commerce.
1. Wrong user-agent token for ChatGPT browsing
Using ChatGPT as the user-agent token — it is not valid. The bot that fetches on a user's behalf uses the token ChatGPT-User. The training crawler uses GPTBot. These are two entirely separate bots with separate tokens, separate functions, and separate robots.txt rules. A rule written for ChatGPT matches nothing. A rule written for GPTBot has zero effect on ChatGPT-User.
2. Blanket-blocking all AI bots and losing buyer-agent traffic
Adding a User-agent: * / Disallow: / block or applying Cloudflare's managed robots.txt without review catches all AI bots including Agent Buyers and Agent Indexers. The correct structure: explicit Allow: / rules for all AGENT BUYERS and AGENT INDEXERS by their specific user-agent tokens, then explicit Disallow: / for training crawlers you want to exclude, then User-agent: * / Allow: / for everything else. The specific rules override the wildcard for those bots.
3. Assuming blocking GPTBot stops ChatGPT from finding you
GPTBot trains the model weights. OAI-SearchBot builds the search index ChatGPT answers draw from. ChatGPT-User fetches your page live during a shopping session. Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User fully active — your products still surface in ChatGPT shopping. If you want to block all OpenAI access, you need separate Disallow: / rules for all three tokens independently.
4. Using meta robots noai/noimageai tags as the primary control
The HTML meta directives <meta name="robots" content="noai, noimageai"> are not part of RFC 9309 and are not universally respected. Meta (the company) explicitly uses robots.txt, not NoAI tags. Anthropic, Google, and OpenAI do not document support for these tags. Use robots.txt user-agent blocks as your primary control. HTML meta tags may evolve, but robots.txt is the standard today.
5. Blocking by IP instead of user-agent token
Blocking a vendor's published IP ranges at the firewall prevents the bot from reading your robots.txt at all — Anthropic explicitly warns that IP blocking "may impede Anthropic's ability to read robots.txt." A bot that cannot read robots.txt may treat your site as unconstrained. Use IP blocking only as a supplementary layer for confirmed bad-actor crawlers (Bytespider with documented compliance issues) after setting the correct user-agent token rule in robots.txt first.
6. Forgetting the Sitemap directive
Without a Sitemap: directive, AI indexers may still crawl your site by following links — but they start discovery from whatever links they already know. The Sitemap directive is not part of the formal RFC 9309 user-agent/rule grammar, but it is universally supported as an extension. Add it at the end of your robots.txt file to give every crawler — AI and traditional — a systematic URL inventory. For the full sitemap strategy, see the Agent-Readable Sitemap spoke.
7. Case-sensitive path errors in Disallow rules
RFC 9309 specifies that paths in robots.txt are case-sensitive. Disallow: /Private/ does not match /private/. Disallow: /Admin/ does not match /admin/. Verify the actual URL casing your server uses before writing Disallow rules for sensitive paths — especially on case-insensitive file systems (most Windows servers) where the URL paths may look equivalent but robots.txt matching is case-sensitive regardless.
8. Misconfiguring Google-Extended and losing Gemini grounding
Google-Extended does not send HTTP requests. It is a robots.txt control token that instructs Google's existing crawlers whether already-crawled content may be used for Gemini model training and grounding. Disallowing it does not protect you from Googlebot, does not affect Search rankings, and does not stop Google from knowing your content exists. It only removes your content from Gemini Apps responses and Vertex AI grounding. If you want to stay in Gemini while keeping Google Search rankings, keep Google-Extended Allowed.
§11 · FAQ
Frequently asked questions.
Will blocking GPTBot stop ChatGPT shopping from finding me?
No. GPTBot trains OpenAI's foundation models. The bots that surface your products to ChatGPT users are OAI-SearchBot (which builds ChatGPT's search index) and ChatGPT-User (which fetches your page live when a user asks a question). All three have different user-agent tokens. A Disallow: / for GPTBot only affects training data collection. To stop ChatGPT shopping traffic entirely, you would need separate Disallow: / rules for OAI-SearchBot and ChatGPT-User as well.
Do AI bots actually respect robots.txt?
Major commercial vendors (OpenAI, Anthropic, Perplexity, Google, Amazon, Apple, Meta) state in their documentation that their bots respect robots.txt. Empirically, the compliance record is mixed. Bytespider has a documented history of ignoring robots.txt directives. A 2024 investigation found that Perplexity-User was sending generic Chrome user-agent strings instead of its declared Perplexity-User token, effectively bypassing robots.txt rules. Common Crawl (CCBot) reports compliance. For crawlers with a compliance record of concern, layer server-side user-agent blocking or firewall rules on top of robots.txt.
Should I block Bytespider?
Yes, with robots.txt and a firewall rule. Bytespider is operated by ByteDance (TikTok's parent company). Stack Overflow users and independent researchers have documented that Bytespider ignores robots.txt and does not read the file before crawling. Adding a Disallow: / rule for Bytespider in robots.txt is the correct signal; however, given the compliance history, also add a user-agent string firewall block at your CDN or server to catch requests matching Bytespider in the user-agent header.
What's the difference between GPTBot and ChatGPT-User?
GPTBot is an automated crawler that harvests web content to train OpenAI's foundation models. It runs continuously and at scale. ChatGPT-User is a user-triggered fetcher — when a ChatGPT user (or a Custom GPT) asks a question that requires visiting a specific URL, that single request is sent with the ChatGPT-User token. GPTBot is the training pipeline; ChatGPT-User is the live shopping/browsing pipeline. Block one, allow the other — or configure them completely independently.
Does Google-Extended affect my Google Search ranking?
No. Google's documentation is explicit: "Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search." Setting Disallow: / for Google-Extended only removes your content from Gemini model training datasets and from grounding responses in Gemini Apps and Vertex AI. Your organic rankings in Google Search are determined by Googlebot's standard crawl, which Google-Extended does not control.
Can I allow AI crawlers only to product pages and block them from my blog?
Yes. The longest-match rule in RFC 9309 makes this precise. For example: User-agent: GPTBot / Allow: /products/ / Allow: /collections/ / Disallow: / — the /products/ Allow (9 characters) is more specific than the / Disallow (1 character), so product pages are crawled and blog posts are not. Test with: curl -A "GPTBot" -I https://yourstore.com/products/widget-pro to verify server-side behavior, then confirm in Google Search Console that your robots.txt is parsed as intended.
How do I verify a bot is really from OpenAI (or Anthropic, Perplexity, etc.)?
User-agent strings can be spoofed. Verification requires checking the source IP against published IP ranges: 1. Find the IP address of the request in your access logs. 2. Check it against the vendor's published JSON list (e.g., https://openai.com/gptbot.json for GPTBot). 3. For Google: perform a reverse DNS lookup (host IP_ADDRESS) — the result should resolve to *.googlebot.com or *.google.com. Then forward-verify that the resolved hostname points back to the original IP. If the IP is not in the published range and does not pass reverse DNS verification, the user-agent string is spoofed. Treat it as an unauthorized crawler and block at the firewall level.
What about agents.txt — is it real yet?
Not yet, in any implemented form. As of this writing, two competing proposals use the "agents.txt" label. An IETF individual Internet-Draft (draft-srijal-agents-policy-00, October 2025) proposes a strict policy file at /agents.txt with SHA-256 hash verification and a fail-closed approach (malformed file = fully restricted). A separate community proposal by Jasper van Veen (originally agents.txt, renamed to agent-manifest.txt in March 2026) covers richer capability declaration including API endpoints, action permissions, and MCP server discovery. Neither proposal has been adopted by any major AI vendor. Zero crawlers currently implement either spec. Deploy robots.txt as your primary bot policy control. Place an experimental manifest at /agents.json if you want to signal capability early. Monitor https://datatracker.ietf.org/doc/draft-srijal-agents-policy/ for standards progression.
§12 · Step-by-Step
Configuring robots.txt for agent commerce, in five steps.
Each step mirrors the HowTo JSON-LD at the top of this page word for word. Execute in order — the output of each step is the input to the next.
Step 1 — Inventory bots in your logs
Pull the last 30 days of access logs and extract all user-agent strings that include known AI bot tokens. Use the grep pattern from the Verification section. Create a spreadsheet with columns: token, request count, paths accessed, source IPs. This baseline tells you which bots are actually hitting your site today and whether any paths you expected to be accessible are getting 403s or redirects.
Step 2 — Classify each bot into buyer / indexer / training
Using the Bot Directory table and the taxonomy summary, assign each active bot to one of the three categories. Flag any unrecognized tokens for manual research — look them up against the vendor's official documentation, not community blocklists. Pay attention to bots where the classification is ambiguous (e.g., meta-externalagent is described by Meta as covering both AI indexing and training).
Step 3 — Draft your allow/block policy
For each category, apply the commerce-default rules from this document: Allow all buyers, Allow all indexers, make a deliberate choice on training crawlers (all-block, selective path allow, or full allow). Document the business rationale for each training crawler decision — you will revisit this as AI commerce matures. Decide whether you want Google-Extended and Applebot-Extended Allowed (your content may improve future Gemini/Apple Intelligence responses) or Disallowed (your content stays out of those training pipelines but you remain in Google Search and Spotlight).
Step 4 — Deploy robots.txt + cross-references
Replace or update your robots.txt with the complete runnable file from the Reference robots.txt section, adjusted for your actual paths. Add the Sitemap: directive with your product sitemap URL. Publish /llms.txt linking to your key product pages. Publish /agents.json with your store's API surface. Verify robots.txt is accessible at exactly https://www.yourstore.com/robots.txt with Content-Type: text/plain and UTF-8 encoding. Submit the updated robots.txt in Google Search Console for immediate re-parse.
Step 5 — Verify with curl, logs, and vendor tools
Run curl tests for each major bot on representative URLs: your homepage, a product page, a collection page, and any admin or private paths that should be blocked. Check Google Search Console's robots.txt report for parse errors. Pull logs 48–72 hours after deployment to confirm that AGENT BUYER tokens are appearing on product pages and that training crawler tokens are not. Set a calendar reminder to re-check vendor bot documentation quarterly — user-agent strings and version numbers change, and new bots are introduced without notice.
§13 · Continue the Guide
Next stops in the Discovery quadrant.
The Window
The window for getting agent discovery right is now.
Every quarter, the floor moves up. User-agent tokens once ignored by bots are now enforced. Google-Extended is now a real signal for Gemini grounding eligibility. OAI-SearchBot is indexing at scale. The merchants who configure their bot policy correctly now — allowing buyers and indexers unconditionally, making deliberate choices on training crawlers, and deploying the full three-file discovery triad — get compounding benefit from every new AI commerce channel that launches. The merchants who ship a blanket block or leave their default robots.txt will spend the back half of 2026 debugging why their products don't surface in AI shopping results.
Open the AgentMall Roadmap →