Discovery Spoke · Agent Sitemap
SPOKE · DISCOVERY — HOW AGENTS FIND YOU

The Agent-Readable Sitemap — Extending sitemap.xml + llms.txt + OFA

Discovery has two halves: the bot policy layer (covered in the robots.txt spoke) and the exposure layer you control here. An agent readable sitemap is not one file — it is a stack. sitemap.xml is your URL inventory. llms.txt is your curated agent navigation guide. /agents.json is the forward-looking capability manifest. "OFA" as a discrete named spec does not currently exist; what does exist is AAIF, A2A, and DNS-AID, none of which map to an HTTP sitemap extension. What you can act on today: a complete sitemap.xml with accurate lastmod, a llms.txt built for agent inference, and cross-links across all four discovery files. AI crawlers depend on sitemaps more than Googlebot does — they lack 25 years of link-graph history — so getting this layer right pays dividends no amount of on-page SEO can replicate.

82%
Crawl rate for pages in sitemap (re-verify)
4
Discovery stack layers
50k
URL limit per sitemap file
Sep 2024
llms.txt proposal date
§1 · The Root Problem

Discovery is Problem #1 — agents can't buy what they can't find.

The AgentMall roadmap names three ways the internet fails AI agents. The first — and the one that invalidates everything downstream — is discovery: agents can't find products. The second is comprehension: product pages aren't machine-readable. The third is transaction: no agent-native checkout exists yet. This spoke addresses Problem #1 entirely. The product data spoke handles Problem #2. Layer 4 and the UCP spoke attack Problem #3.

The discovery quadrant has four files. Each owns one responsibility. You are reading the spoke for the middle two — sitemap.xml and llms.txt — plus an honest assessment of the "OFA" capability hint layer.

File Discovery Spoke What It Controls Status
robots.txt Spoke 1 — robots.txt + agents.txt Who is allowed to crawl what RFC 9309 Standard
sitemap.xml + llms.txt This spoke (Spoke 2) What URLs exist + which are worth reading Established / Community proposal
Agent SEO signals Spoke 3 — Agent SEO Which results get cited and recommended Vendor-confirmed + inferred
/agents page Spoke 4 — /agents page Capability manifest: MCP, auth, endpoints Draft / no single standard

The three-failure framing matters for how you prioritize: there is no point optimizing agent SEO ranking signals if your sitemap excludes half your product catalog. There is no point building an MCP server if ClaudeBot can't discover your product pages. Fix discovery first, then work outward.

Key Insight

AI crawlers are significantly more dependent on sitemaps than Googlebot. Googlebot has decades of link-graph knowledge and domain trust models. Smaller AI crawlers lack this history and rely on sitemaps as their primary discovery mechanism. A stale or incomplete sitemap hurts AI discovery disproportionately more than it hurts traditional SEO.

§2 · The Protocol

sitemap.xml Protocol 0.9 — what the spec actually says.

The sitemaps.org Protocol 0.9 defines a single XML namespace (http://www.sitemaps.org/schemas/sitemap/0.9) and four fields per <url> entry. Before you optimize for agents, understand what the standard actually gives you.

Field Required Format Google behavior Bing behavior
<loc> Yes Absolute URL, < 2,048 chars, URL-escaped Honored Honored
<lastmod> No W3C Datetime (YYYY-MM-DD or full ISO 8601) Used when consistently accurate; ignored if stale Must reflect true content modification; recommends full ISO 8601 with timezone
<changefreq> No always / hourly / daily / weekly / monthly / yearly / never Ignored entirely Ignored
<priority> No 0.0–1.0 (default 0.5) Explicitly ignored Ignored

Google Search Central documentation is blunt: Google ignores <priority> and <changefreq> entirely. <lastmod> is used only "if it is consistently and verifiably accurate" — Google validates it against actual page modification timestamps and builds a per-domain trust score. If your sitemap stamps every URL with today's date on every regeneration, Google will eventually stop trusting the field. Bing's guidance (July 2025) mirrors this: Bing recommends ISO 8601 with time component (2025-11-14T10:30:00+00:00), ignores changefreq and priority, and says lastmod must reflect the true last modification of the page content — not when the sitemap was regenerated. (Re-verify before launch.)

Size limits and sitemap index files

A single sitemap file: 50,000 URLs maximum, 50 MB uncompressed. Hit either limit and you need a sitemap index file (<sitemapindex>), which can reference up to 50,000 child sitemaps and theoretically cover billions of URLs. Google Search Console allows up to 500 sitemap index files per verified property. For most Shopify and WooCommerce stores with under 50,000 SKUs, a single sitemap is fine. Larger catalogs — marketplaces, multi-vendor platforms — need the index pattern immediately.

Why sitemaps matter more for AI crawlers than for Google

Site log analysis across multiple stores shows pages in the sitemap receive approximately 82% crawl coverage from AI bots; pages not in the sitemap receive around 12%. One case study: 47 blog posts added to a sitemap resulted in 31 being indexed by at least one AI crawler within three weeks, with 8 appearing in Perplexity answers. (Re-verify before launch — source: Reddit/GEO_optimization community analysis.) Sitemaps function as a crawl budget allocation tool for agents: they signal which URLs exist, and lastmod signals which URLs changed recently and deserve a revisit.

Runnable sitemap.xml — 5 representative entries, correct format
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>
    <loc>https://examplestore.com/</loc>
    <lastmod>2025-11-01T09:00:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/collections/hiking</loc>
    <lastmod>2025-10-28T14:30:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/products/osprey-atmos-65</loc>
    <lastmod>2025-11-10T08:15:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/products/merrell-moab-3-gtx</loc>
    <lastmod>2025-11-05T11:00:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/pages/returns</loc>
    <lastmod>2025-09-15T10:00:00+00:00</lastmod>
  </url>

</urlset>

Notes: changefreq and priority are omitted — both are ignored by Google and Bing. lastmod uses full ISO 8601 with timezone offset, per Bing's recommendation.

§3 · The Gaps

Where sitemap.xml falls short for agents.

The core limitation: sitemap.xml is a URL inventory, not a semantic map. A sitemap containing 5,000 URLs tells an agent crawler those URLs exist. It tells the agent nothing about what any of them are or can do. None of the six gaps below are solvable by adding more <url> entries — they require a separate layer.

Gap 1 · Semantic Role

No content type signal

Is /p/12345 a product, a policy page, a blog post, an API endpoint, or a checkout confirmation? The sitemap cannot tell an agent. It treats all URLs identically.

Gap 2 · Capability Signals

No action declarations

Does this store accept orders via MCP? Does it expose a SearchAction endpoint? sitemap.xml has no field for capability hints — agents must infer everything from page content.

Gap 3 · Schema Indication

No structured data preview

Which URLs have Product structured data? Which have FAQPage? Which have Event? The sitemap cannot signal this — an agent must fetch each page to find out.

Gap 4 · Render Type

HTML vs SPA vs PDF

Is the page server-rendered HTML, a JavaScript SPA that returns empty markup to headless fetchers, or a PDF? Agent crawlers that don't execute JS will receive blank pages with no warning.

Gap 5 · Language Variants

hreflang often missing

hreflang annotations live in the sitemap or page headers but are often omitted for international stores, leaving agents with US-EN content only — surfacing wrong pricing and shipping to non-US users.

Gap 6 · Pagination Noise

Near-duplicate URL flood

Faceted navigation or query parameters (?sort=price&page=3) flood the sitemap with near-duplicate URLs that dilute agent crawl budget and confuse which URL is canonical.

The Solution

A separate layer is required

llms.txt, /agents.json, and Schema.org action types each address specific gaps. No single extension to the sitemap spec solves all six — you need the full stack described in the next section.

§4 · The Full Stack

The agent-readable sitemap stack — four layers compared.

Three files work together to give agent crawlers what sitemap.xml alone cannot. A fourth file — the capability manifest — is covered in the /agents page spoke. The table below shows their current status honestly.

Layer File Purpose Format Status Example Location
1 — Inventory sitemap.xml Full URL inventory with freshness signals via lastmod XML (sitemaps.org Protocol 0.9) Established standard /sitemap.xml
2 — Navigation llms.txt Curated agent navigation: which URLs are worth reading and why Markdown (llmstxt.org proposal) Community proposal — no IETF/W3C RFC /llms.txt
2b — Bulk Ingest llms-full.txt Concatenated full content for single-fetch ingestion by agents Markdown (same proposal) Community proposal /llms-full.txt
3 — Capability /agents.json or /agents Capability declaration: MCP endpoints, actions, auth scheme JSON (no single dominant spec yet) Draft / vendor-specific — see /agents spoke /agents.json

The fourth slot — a machine-readable capability manifest — is cross-referenced to the /agents page spoke, which covers /agents.json format and content in depth. This spoke owns the discovery and navigation layers; that spoke owns capability declaration. The robots.txt spoke owns the access control layer that governs who is allowed to fetch any of these files.

Platform Cross-Link

Shopify's native /sitemap.xml covers products, collections, pages, and blog posts automatically. WooCommerce requires Yoast SEO or RankMath to generate a compliant sitemap. BigCommerce has built-in XML sitemap generation. Headless stacks (Sanity, Contentful, Strapi) need a build-time sitemap generator. See the Shopify, WooCommerce, BigCommerce, and Headless spokes for platform-specific steps.

AgentMall · Weekly Dispatch

Agent discovery is moving fast. Stay current.

One operator-focused note per week: crawler behavior changes, spec updates, and spoke releases the morning they publish.

§5 · The Navigation Layer

llms.txt Deep Dive — origin, format, and live adoption.

Origin and status

Jeremy Howard (co-founder of Answer.AI and fast.ai) published the initial llms.txt proposal at llmstxt.org on September 3, 2024. The proposal is community-driven and hosted on GitHub. There is no IETF Internet-Draft, no W3C Working Note, and no formal standards body backing as of this writing. Adoption is bottom-up: Mintlify rolled out auto-generation for all docs sites it hosts in late 2024, which alone bootstrapped thousands of sites including Anthropic developer docs and Cursor. (Re-verify current adoption status before launch — this landscape changes rapidly.)

Exact format specification

The llms.txt file must be located at /llms.txt in the root path. It follows strict Markdown structure in a defined order:

Section Required / Optional Purpose Example
# H1 site title Required Identifies the site or project # ExampleStore
> blockquote summary Optional One-paragraph context for the agent > An independent outdoor gear retailer…
Free Markdown (no headings) Optional Additional background, caveats, behavioral notes Notes about live pricing, no dropshipping, etc.
## Section heading + list Optional, repeatable Groups of curated links with descriptions ## Products, ## Policies, ## API
## Optional (special name) Optional Links skippable for shorter context windows Blog archive, press kit, secondary docs

Each list item inside a section follows this exact pattern:

- [Link title](https://full-url): One-sentence description of what's on this page.

The description is not decorative. Agents use it as standalone context — they may decide whether to fetch the URL based on the description alone. Specifics beat generics: "Returns policy: 30-day free returns, no restocking fee, prepaid label included" is more useful than "Our returns page."

llms.txt vs llms-full.txt

/llms.txt: The curated navigation map. Small file. Points to the important URLs. A human reading it with a text editor would find it readable. IDE agents (Cursor, Claude Code, Windsurf) fetch this to locate relevant pages before diving deeper.

/llms-full.txt: The concatenated full content of every linked page, in a single Markdown file. Intended for agents that want to ingest the entire relevant content of a site in one request without following individual links. FastHTML ships both llms-ctx.txt (without Optional section links) and llms-ctx-full.txt (with all links expanded) using the llms_txt2ctx CLI tool. For a Shopify or WooCommerce merchant: llms.txt alone is usually sufficient. llms-full.txt becomes valuable for SaaS products, developer documentation, or API-heavy sites where agents need full reference material in one pull.

Verified live adoption — fetch results

Site URL Status Notes
Anthropic developer docs https://docs.anthropic.com/llms.txt 200 confirmed Full structured file with 1,557 English pages listed. (Re-verify before launch.)
Anthropic main site https://anthropic.com/llms.txt 404 Not present at root domain.
Stripe https://stripe.com/llms.txt 200 confirmed Full product hierarchy in Markdown sections. (Re-verify before launch.)
Cloudflare Developer Docs https://developers.cloudflare.com/llms.txt 200 confirmed Points to per-product sub-llms.txt files (nested structure). (Re-verify before launch.)
Cursor https://cursor.com/llms.txt 200 confirmed Docs sections listed; some malformed URLs present (double-domain bug noted). (Re-verify before launch.)
Perplexity https://perplexity.ai/llms.txt 404 Not present.

Complete runnable llms.txt example for a commerce site

# ExampleStore — Outdoor Gear & Apparel

> ExampleStore is an independent US-based retailer of outdoor gear, hiking apparel,
> and camping equipment. We ship to all 50 states with free shipping on orders over $75.
> All products are stocked in our Portland, OR warehouse. We do not dropship.

Important notes for AI assistants:
- Product availability and pricing are live; always reference the product URL, not cached data.
- We accept Visa, Mastercard, Amex, PayPal, and Shop Pay.
- Returns are accepted within 30 days; free prepaid label provided.

## Core Store Pages

- [Homepage](https://examplestore.com/): Main landing page with featured collections and current promotions.
- [All Products](https://examplestore.com/collections/all): Full catalog browsable by category, brand, and activity.
- [Hiking Gear](https://examplestore.com/collections/hiking): Boots, trekking poles, packs, and trail clothing. 400+ SKUs.
- [Camping Equipment](https://examplestore.com/collections/camping): Tents, sleeping bags, stoves, and cookware.
- [Sale](https://examplestore.com/collections/sale): Currently discounted items; updated daily.

## Policies & Customer Service

- [Shipping Policy](https://examplestore.com/pages/shipping): Free shipping over $75 (US only), 2–5 business days standard, next-day available at checkout.
- [Returns & Exchanges](https://examplestore.com/pages/returns): 30-day return window, free prepaid label, exchange or store credit issued within 3 business days.
- [FAQ](https://examplestore.com/pages/faq): Answers to sizing, gift cards, order tracking, and wholesale inquiries.
- [Contact](https://examplestore.com/pages/contact): Email support@examplestore.com; phone 1-800-555-0192 (Mon–Fri 9am–5pm PT).

## Product Highlights

- [Osprey Atmos 65 Pack](https://examplestore.com/products/osprey-atmos-65): Top-selling backpacking pack; sizes XS–XL; currently in stock in all colors.
- [Merrell Moab 3 GTX Hiking Boot](https://examplestore.com/products/merrell-moab-3-gtx): Gore-Tex waterproof; men's and women's versions; sizes 6–14.
- [MSR WhisperLite Stove](https://examplestore.com/products/msr-whisperlite): Multi-fuel backpacking stove; includes pump and maintenance kit.

## Technical / Agent Integration

- [Sitemap](https://examplestore.com/sitemap.xml): Full URL inventory in sitemaps.org Protocol 0.9 format.
- [Agents Manifest](https://examplestore.com/agents.json): Capability endpoints, MCP server location, supported actions, and auth scheme.
- [Structured Data Overview](https://examplestore.com/pages/schema-info): Notes on Product, Offer, and SearchAction JSON-LD present on product pages.

## Optional

- [Brand Story](https://examplestore.com/pages/about): History, mission, and team behind ExampleStore. Skip if context window is tight.
- [Press Kit](https://examplestore.com/pages/press): Brand assets and media contact. Skip for product or shipping queries.
- [Blog](https://examplestore.com/blogs/news): Gear reviews, trail guides, and seasonal picks. Large archive; skim index only unless topic-specific.
§6 · Honest Assessment

OFA — what actually exists in the agent-discovery standards space.

Critical — Read Before Acting

"Open Foundation Agents" or "OFA" as a discrete, named web-discovery specification does not appear in any current standards body documentation, IETF Internet-Draft index, or major vendor's published roadmap as of this writing. Operators do not need to implement anything called "OFA" today. The term may refer to a concept in early-stage discussion that has not yet coalesced into a named, published specification. (Re-verify before launch — the standards landscape evolves quarterly.)

What does exist in the agent-discovery standards space is four active initiatives, none of which map to an HTTP sitemap extension:

Initiative Org Date Scope Relevance to sitemap/page discovery
AAIF (Agentic AI Foundation) Linux Foundation December 2025 Governs MCP (Anthropic), AGENTS.md (OpenAI), Goose (Block). Agent-to-tool protocol interoperability. None — operates at tool protocol layer, not web page discovery
Agent2Agent (A2A) Google April 2025 Agent-to-agent communication and capability discovery via "Agent Cards" in JSON. 50+ partners. None — agent-to-agent, not crawler-to-sitemap
DNS-AID Linux Foundation / Infoblox May 2026 Enables AI agents and MCP servers to use DNS as a vendor-neutral discovery directory. None — DNS infrastructure layer, not HTTP/sitemap layer
IETF draft-narajala-ans IETF Active draft Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery. DNS-based registry for agent identities. None — not a web sitemap or llms.txt complement. (Re-verify draft status before launch.)

The closest actionable analog to what an agent-discovery extension might eventually become is the /agents.json manifest discussed in the /agents page spoke. Watch aaif.io for evolving standards. Your actionable stack today: complete sitemap.xml with accurate lastmod, a curated llms.txt, and a forward reference to /agents.json. Revisit OFA/AAIF standards quarterly.

§7 · Capability Hints

Schema.org Action Types — SearchAction, BuyAction, and Offer.

Schema.org's potentialAction property lets you attach machine-readable action descriptors directly to pages. For commerce, these types are operationally relevant as forward-looking capability signals. No current AI shopping agent executes BuyAction autonomously against arbitrary merchants today, but publishing this structured data costs nothing and positions your catalog as agent-commerce matures.

Schema.org Type Hierarchy Commerce Use Case Where to Place
SearchAction Thing > Action > SearchAction Declares your site search endpoint; agent can query your catalog directly WebSite object on homepage
BuyAction Thing > Action > TradeAction > BuyAction Declares that a purchase can be initiated at a specific endpoint Offer object on product pages
ReserveAction Thing > Action > TradeAction > ReserveAction Reservation or hold on a product or service Offer on product pages where holds are supported
OrderAction Thing > Action > TradeAction > OrderAction Placing a complete order Offer on product pages with full order API

SearchAction on the WebSite object (sitewide)

This JSON-LD, placed in the <head> of your homepage, tells agents your search endpoint. An agent that parses this block knows it can query your catalog by issuing a GET request — no human interaction required.

SearchAction — WebSite object, full runnable JSON-LD
{
  "@context": "https://schema.org",
  "@type": "WebSite",
  "name": "ExampleStore",
  "url": "https://examplestore.com",
  "potentialAction": {
    "@type": "SearchAction",
    "target": {
      "@type": "EntryPoint",
      "urlTemplate": "https://examplestore.com/search?q={search_term_string}"
    },
    "query-input": "required name=search_term_string"
  }
}

BuyAction + Offer on a product page

BuyAction — Product + Offer, full runnable JSON-LD
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Osprey Atmos 65 Backpack",
  "description": "Multi-day backpacking pack with anti-gravity suspension. Available in sizes XS–XL.",
  "sku": "OSP-ATMOS65-GRN-M",
  "gtin14": "00070159040031",
  "brand": {
    "@type": "Brand",
    "name": "Osprey"
  },
  "offers": {
    "@type": "Offer",
    "url": "https://examplestore.com/products/osprey-atmos-65",
    "priceCurrency": "USD",
    "price": "289.95",
    "availability": "https://schema.org/InStock",
    "seller": {
      "@type": "Organization",
      "name": "ExampleStore"
    },
    "potentialAction": {
      "@type": "BuyAction",
      "target": "https://examplestore.com/cart/add?id=osprey-atmos-65"
    }
  }
}

For deeper structured data coverage — including AggregateRating, hasMerchantReturnPolicy, shippingDetails, and GTIN — see the product data spoke. For the full agent-ready API that these action types point to, see the API spoke. Platform-specific schema implementation is covered in the Etsy, Shopify, WooCommerce, and BigCommerce spokes.

§8 · Crawler Behavior

Per-agent crawler behavior — what each bot actually does with your sitemap.

Based on vendor documentation and observed log data. Treat specific crawl frequencies as directional, not guaranteed — OpenAI, Anthropic, and Perplexity have not published formal documentation about how their crawlers consume sitemaps. (Re-verify before launch — all log study findings are based on third-party analysis and may not reflect current behavior.)

Bot Operator Category Sitemap Behavior Notes
GPTBot OpenAI AI Training Crawler Reads sitemap.xml; behavior confirmed in logs as of March 2026. Approximately 4,200 hits/day on active sites. Blocking GPTBot does NOT block ChatGPT shopping — that is OAI-SearchBot + ChatGPT-User. See robots.txt spoke for policy guidance.
OAI-SearchBot OpenAI Agent Indexer Purpose: surfacing sites in ChatGPT Search. Checks robots.txt 3–6×/day; sitemap behavior not explicitly documented by OpenAI. Routes through Bing's index for ChatGPT Search results. Verify Bing Webmaster Tools as the proxy for OAI-SearchBot coverage. Allow always for commerce.
ClaudeBot Anthropic AI Training Crawler Began reading sitemap.xml March 18, 2026 (confirmed across multiple site log studies). Previously ignored it. Approximately 1,800 hits/day on active sites. 85% of traffic historically on robots.txt. Not documented by Anthropic. Operator's call on whether to allow training crawlers. See robots.txt spoke for the SEO-vs-training tension.
Claude-SearchBot Anthropic Agent Indexer Active sitemap consumer; became 2nd most active sitemap reader after Bingbot in at least one log study. (Re-verify before launch.) Allow always for commerce — this is the indexer that surfaces your products in Claude queries.
PerplexityBot Perplexity Agent Indexer Primarily on-demand; fetches when user query references the domain. Burst pattern rather than scheduled crawl. Some logs show it does not request sitemap.xml regularly. Approximately 980 hits/day on active sites. Allow always. Perplexity treats products without GTIN as effectively invisible in shopping results — see agent SEO spoke for the GTIN fix.
Google-Extended Google robots.txt control token Mirrors Googlebot's index footprint; quiet baseline, 14-day revisit cadence in log studies. No separate sitemap submission — uses same sitemap infrastructure as Googlebot. Approximately 540 hits/day. Google-Extended is a robots.txt permission token, not an independent crawler. It controls Gemini training and grounding data. Disallowing it does NOT affect Google Search rankings or Googlebot indexing.
Bytespider ByteDance AI Training Crawler Crawls commercial pages aggressively — 1.8-day revisit cadence on retail sites observed in logs. Sitemap behavior confirmed in log analysis. Operator's call. Some commerce operators block Bytespider as a training crawler. Confirm it is not also used as a buyer or indexer for any TikTok Shop integration before blocking. (Re-verify before launch.)
Applebot-Extended Apple AI Training Crawler Not covered in current log study literature at the depth of the above. Controls Apple Intelligence training data. Operator's call on whether to allow.

Verification — curl, Search Console, Bing Webmaster, IndexNow, and logs

Test what agent crawlers actually receive from your sitemap. These commands simulate the user-agent string each bot sends:

Manual verification with curl — simulate each major bot
# Check your sitemap returns 200 and valid XML
curl -I https://examplestore.com/sitemap.xml

# Simulate GPTBot fetching your sitemap
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot" \
  https://examplestore.com/sitemap.xml

# Simulate ClaudeBot
curl -A "ClaudeBot/1.0 (+claudebot@anthropic.com)" \
  https://examplestore.com/sitemap.xml

# Simulate OAI-SearchBot
curl -A "OAI-SearchBot/1.0; +https://openai.com/searchbot" \
  https://examplestore.com/sitemap.xml

# Check llms.txt is live
curl -I https://examplestore.com/llms.txt

# Check llms-full.txt (if deployed)
curl -I https://examplestore.com/llms-full.txt

If any of these return a 403 or redirect, your server or WAF may be blocking the user-agent string. Cross-reference with your bot policy to ensure you are not unintentionally blocking crawlers you want. (Re-verify user-agent strings before launch — vendors add and rename bots.)

Log analysis — grep for AI bot sitemap hits
# Find all AI bot hits to sitemap.xml in nginx/Apache combined log format
grep -iE "(GPTBot|ClaudeBot|OAI-SearchBot|PerplexityBot|Google-Extended|Claude-SearchBot)" /var/log/nginx/access.log \
  | grep "sitemap" \
  | awk '{print $1, $7, $9}' \
  | sort | uniq -c | sort -rn

Watch for: 200 on sitemap.xml — crawler received the file. 200 on llms.txt — crawler received your agent navigation layer. 404 or 403 on either — fix immediately. Sudden spike in sitemap hits from a bot you haven't seen before — new crawler rollout (the March 2026 simultaneous ClaudeBot/GPTBot sitemap adoption is a precedent).

Google Search Console

Google Search Console lets you submit sitemaps, inspect coverage, and see which URLs were indexed. Google-Extended does not have a dedicated GSC report — it piggybacks on Googlebot's infrastructure. Submit your sitemap via GSC and confirm 0 processing errors before any launch.

Bing Webmaster Tools

Bing Webmaster Tools accepts direct sitemap submission and provides submission status, last-read date (confirms Bing fetched the sitemap), and processing errors. Since OAI-SearchBot routes through Bing's index for ChatGPT Search results, verifying Bing indexing is the proxy for OAI-SearchBot coverage.

IndexNow for real-time URL push

Bing and Yandex support IndexNow — a real-time URL-level push protocol. When you update a product page, push the URL to IndexNow immediately rather than waiting for Bing's scheduled sitemap crawl. Shopify and many WooCommerce plugins support IndexNow natively. (Re-verify availability per platform before launch.)

§9 · Common Mistakes

Eight ways sitemap and llms.txt configurations fail in production.

1. Outdated lastmod values

Regenerating your sitemap daily and stamping every URL with today's date destroys Google's and Bing's trust in your freshness signals. They stop honoring the field — and AI crawlers that rely on lastmod to decide revisit priority will treat your entire catalog as equally stale.

<!-- Wrong: sitemap generation timestamp -->
<lastmod>2025-11-14</lastmod>

<!-- Right: actual page content modification timestamp -->
<lastmod>2025-09-22T11:30:00+00:00</lastmod>

Set lastmod programmatically from your CMS's updated_at field, not from Date.now() in your sitemap generator.

2. Paginated product URLs flooding the sitemap

Shopify and WooCommerce collection pages generate /collections/hiking?page=2, /collections/hiking?sort_by=price, etc. Listing these in the sitemap wastes agent crawl budget on near-duplicate content and confuses which URL is canonical. Fix: Include only canonical collection landing pages. Exclude query-string variants with ?page=, ?sort=, ?filter=. In Shopify, use a custom sitemap template or a sitemap app that respects canonical logic. In WooCommerce, Yoast SEO excludes paginated variants by default when configured correctly.

3. One giant sitemap exceeding 50,000 URLs or 50 MB

Google and Bing stop processing a sitemap once it hits either limit. The fix is a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://examplestore.com/sitemap-products.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://examplestore.com/sitemap-pages.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://examplestore.com/sitemap-collections.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
</sitemapindex>

4. llms.txt pointing to 404 URLs

An llms.txt pointing to 404 URLs is worse than no llms.txt — agents waste fetch budget on dead links and may penalize the file's reliability. Run a link checker as part of your deployment pipeline:

curl -s https://examplestore.com/llms.txt | grep -oP 'https?://[^\)]+' | while read url; do
  status=$(curl -o /dev/null -s -w "%{http_code}" "$url")
  echo "$status $url"
done | grep -v "^200"

5. Not declaring the sitemap in robots.txt

Some AI crawlers (checking robots.txt before crawling) will only discover your sitemap if it's declared there. If you only submitted it through Google Search Console, agents that don't use GSC (which is all of them except Googlebot) won't find it automatically. Fix: Add to every site's robots.txt:

Sitemap: https://examplestore.com/sitemap.xml

6. Gzipping sitemap.xml.gz without declaring it properly

Sitemaps.org supports .gz compression, but the file must be named correctly and the Sitemap directive in robots.txt must point to the .gz URL, not the uncompressed one. Verify with curl -I that the response includes Content-Encoding: gzip or that the .gz file downloads as a valid gzip archive.

# In robots.txt — point to the actual file served
Sitemap: https://examplestore.com/sitemap.xml.gz

7. llms.txt written as marketing copy instead of agent-targeted navigation

"We're the #1 choice for outdoor adventurers who demand quality" tells an agent nothing useful. The file should be navigation and context, not branding. Every line in llms.txt should answer: "What is on this page and why would an agent need it?"

# Wrong — marketing copy
- [Products](https://examplestore.com/collections/all): Explore our incredible selection of top-quality outdoor gear!

# Right — agent navigation
- [All Products](https://examplestore.com/collections/all): Full catalog; 400+ SKUs across hiking, camping, and apparel. Filterable by brand, activity, and price.

8. Forgetting hreflang annotations for international stores

If you sell to Canada, UK, or Australia with localized URLs, omitting hreflang in your sitemap means agent crawlers may only find and index your US-EN pages — surfacing US pricing and shipping terms to non-US users. Fix: add hreflang links to your sitemap entries and include the xmlns:xhtml namespace declaration in your <urlset> tag:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://examplestore.com/products/osprey-atmos-65</loc>
    <lastmod>2025-11-10T08:15:00+00:00</lastmod>
    <xhtml:link rel="alternate" hreflang="en-us" href="https://examplestore.com/products/osprey-atmos-65"/>
    <xhtml:link rel="alternate" hreflang="en-ca" href="https://examplestore.ca/products/osprey-atmos-65"/>
  </url>
</urlset>
§10 · FAQ

Frequently asked questions.

Do AI agents actually use my sitemap?

Yes, and more than you might expect. Site log analysis consistently shows that AI crawlers use sitemaps as a primary discovery mechanism — more so than Googlebot, which has 25+ years of link-graph data to fall back on. Pages in the sitemap receive approximately 82% crawl coverage from AI bots; pages not in the sitemap receive around 12%. ClaudeBot and GPTBot both began reading sitemap.xml in March 2026 for the first time after years of ignoring it, suggesting continued evolution in how AI crawlers use the file. (Re-verify before launch — crawler behavior evolves.)

What is the difference between sitemap.xml and llms.txt?

sitemap.xml is an inventory — it lists every URL on your site that should be indexed, with freshness signals. It tells agents that URLs exist and when they were last changed. It is entirely URL-focused, with no semantic meaning attached to any individual URL. llms.txt is a navigation guide — it tells agents which of those URLs are worth reading, in what order, and why. It is curated (not exhaustive), written in Markdown, and designed for agent inference at the moment a user asks a question. The two complement each other: sitemap for coverage, llms.txt for signal.

Do I need llms-full.txt or is llms.txt enough?

For most merchants: llms.txt alone is enough. llms-full.txt is valuable when your most important content is documentation or API reference — places where an agent needs to ingest large amounts of technical detail in a single fetch. If you run a documentation-heavy site, developer tool, or SaaS product, ship both. Mintlify, Fern, GitBook, and Vercel Docs generate both automatically for all hosted sites.

How often should lastmod update?

Update lastmod only when the actual content of the page changes. If you update a product's price, description, or availability, update lastmod. Do not update it when you regenerate the sitemap file itself, change navigation, or make backend-only changes that don't affect page content. Google and Bing both validate lastmod against actual page modification history; consistent accuracy builds trust and causes the field to be honored. Inconsistent stamping causes it to be ignored.

Will agents see products if they're behind JavaScript rendering?

No. Agent crawlers are overwhelmingly headless HTTP fetchers that do not execute JavaScript. A Shopify store that server-renders product pages is fine. A React SPA that returns empty <div id="app"></div> to non-JS fetchers is invisible to agent crawlers. Headless commerce implementations using Next.js or Nuxt with server-side rendering are generally safe. Your llms.txt should link to URLs that return clean, parseable HTML or Markdown to a headless GET request — not SPA routes that require JS to populate. Test with curl -A "ClaudeBot/1.0" https://yourstore.com/products/example to see what the agent actually receives.

Does Google-Extended use the same sitemap as Googlebot?

Yes. Google-Extended is a user-agent token — a permission layer, not a separate crawler infrastructure. It operates on top of Googlebot's existing crawl. Web publishers use the Google-Extended robots.txt token to control whether Google can use their content for Gemini model training and grounding. There is no separate sitemap submission for Google-Extended, no separate Search Console view, and no additional sitemap configuration needed. Your existing sitemap infrastructure covers it.

How big can a sitemap get before I need a sitemap index?

A single sitemap file has two limits: 50,000 URLs and 50 MB uncompressed. Hit either limit and you need a sitemap index. For a Shopify store with 500 products, a standard single sitemap is fine. For a marketplace or large catalog with tens of thousands of SKUs, a sitemap index with separate files per content type (products, collections, pages, blog posts) is the right architecture. Google Search Console supports up to 500 sitemap index files per verified property.

Is OFA something I need to implement now?

No. As documented in Section 6, "Open Foundation Agents" or "OFA" as a discrete, named web-discovery specification does not appear in any current standards body documentation or major vendor's published roadmap. The active agent-interoperability standards (AAIF/MCP, Agent2Agent, DNS-AID) operate at the agent-to-agent and agent-to-tool protocol layer, not the page-discovery layer. Your actionable stack today is: complete sitemap.xml with accurate lastmod, a curated llms.txt, and a forward reference to /agents.json (covered in the /agents page spoke). Revisit OFA/AAIF standards quarterly as they evolve.

§11 · Step-by-Step

The sitemap build, in five steps.

Each step mirrors the HowTo JSON-LD at the top of this page word for word. Execute in order. Most operators can complete all five steps in a single focused afternoon.

Step 1 — Audit current sitemap.xml coverage

Pull your existing sitemap and cross-check it against your actual URL inventory. Count URLs in current sitemap with: curl -s https://examplestore.com/sitemap.xml | grep -c loc. Check for HTTP 200 on a sample of product URLs. Confirm: every published product page is listed; paginated and filtered variants are excluded; 404 and redirect chains are cleaned up; the file doesn't exceed 50,000 URLs.

# Count URLs in current sitemap
curl -s https://examplestore.com/sitemap.xml | grep -c loc

# Check for HTTP 200 on a sample of product URLs
curl -s https://examplestore.com/sitemap.xml | grep -oP '(?<=<loc>)[^<]+' | shuf | head -20 | while read url; do
  echo "$(curl -o /dev/null -s -w '%{http_code}') $url"
done

Step 2 — Add accurate lastmod and ensure product URLs are present

For Shopify: use a sitemap app (Sitemap by Slayback, or native Shopify sitemap at /sitemap.xml) and verify lastmod maps to updated_at from the Shopify Storefront API. For WooCommerce: Yoast SEO or RankMath generate compliant sitemaps by default; verify the lastmod source in the plugin settings. For headless: generate sitemap server-side using a build-time sitemap generator that reads updated_at from your commerce platform's API.

Step 3 — Author /llms.txt with curated agent navigation

Create the file following the exact format in Section 5. Include: site title (H1), one-paragraph context blockquote, important behavioral notes (pricing is live, no dropshipping, etc.), sections for core pages, policies, product highlights, and agent/technical links. Use specific, factual descriptions for every link. Deploy at https://yourstore.com/llms.txt. Verify with curl -I https://yourstore.com/llms.txt returns 200.

Step 4 — Declare sitemap in robots.txt and cross-link from /agents.json

Add or confirm: Sitemap: https://yourstore.com/sitemap.xml in your robots.txt. Add a Technical section in llms.txt pointing to your sitemap and to /agents.json. The agents manifest (covered in the /agents page spoke) should in turn reference llms.txt. These three files form a self-referencing discovery graph that any agent can navigate starting from any entry point.

Step 5 — Verify with curl + Search Console + log analysis

Curl each bot user-agent against sitemap.xml and llms.txt using the commands in Section 8. Submit sitemap to Google Search Console; confirm 0 processing errors. Submit sitemap to Bing Webmaster Tools; confirm last-read date updates within 24 hours. Set up a weekly log grep and baseline AI bot sitemap hit rates. If ClaudeBot or GPTBot drops to zero for 14+ consecutive days, something regressed in robots.txt or sitemap formatting.

§12 · Continue the Guide

Next stops in the AgentMall guide.

The Window

The agents that can't find your store can't buy from it.

Every day your sitemap is incomplete, your llms.txt is missing, or your lastmod values are wrong, AI crawlers are spending their budget on your competitors' pages instead of yours. Googlebot had decades to build a link-graph model of the web. The new generation of AI crawlers is relying on sitemaps and llms.txt right now — because they don't have that history yet. The merchants who get these files right first build an early crawl-budget advantage that compounds as agent traffic grows. This is not a long build. It is a focused afternoon. Start with the sitemap audit in Step 1.

Open the AgentMall Roadmap →
AgentMall · Weekly Dispatch

One AgentMall note per week.

Sitemap spec updates, crawler behavior changes, and the next spoke the morning it ships. No fluff. Cancel any time.