The Agent-Readable Sitemap — Extending sitemap.xml + llms.txt + OFA

§1 · The Root Problem

Discovery is Problem #1 — agents can't buy what they can't find.

The AgentMall roadmap names three ways the internet fails AI agents. The first — and the one that invalidates everything downstream — is discovery: agents can't find products. The second is comprehension: product pages aren't machine-readable. The third is transaction: no agent-native checkout exists yet. This spoke addresses Problem #1 entirely. The product data spoke handles Problem #2. Layer 4 and the UCP spoke attack Problem #3.

Live Demo — We Run This

This site runs what this guide teaches: a live llms.txt covering all 16 content clusters, alongside a standard sitemap.xml. Open both and compare — one is for search crawlers, one is for AI agents.

The discovery quadrant has four files. Each owns one responsibility. You are reading the spoke for the middle two — sitemap.xml and llms.txt — plus an honest assessment of the "OFA" capability hint layer.

File	Discovery Spoke	What It Controls	Status
`robots.txt`	Spoke 1 — robots.txt + agents.txt	Who is allowed to crawl what	RFC 9309 Standard
`sitemap.xml` + `llms.txt`	This spoke (Spoke 2)	What URLs exist + which are worth reading	Established / Community proposal
Agent SEO signals	Spoke 3 — Agent SEO	Which results get cited and recommended	Vendor-confirmed + inferred
`/agents` page	Spoke 4 — /agents page	Capability manifest: MCP, auth, endpoints	Draft / no single standard

The three-failure framing matters for how you prioritize: there is no point optimizing agent SEO ranking signals if your sitemap excludes half your product catalog. There is no point building an MCP server if ClaudeBot can't discover your product pages. Fix discovery first, then work outward.

Key Insight

AI crawlers are significantly more dependent on sitemaps than Googlebot. Googlebot has decades of link-graph knowledge and domain trust models. Smaller AI crawlers lack this history and rely on sitemaps as their primary discovery mechanism. A stale or incomplete sitemap hurts AI discovery disproportionately more than it hurts traditional SEO.

§2 · The Protocol

sitemap.xml Protocol 0.9 — what the spec actually says.

The sitemaps.org Protocol 0.9 defines a single XML namespace (http://www.sitemaps.org/schemas/sitemap/0.9) and four fields per <url> entry. Before you optimize for agents, understand what the standard actually gives you.

Field	Required	Format	Google behavior	Bing behavior
`<loc>`	Yes	Absolute URL, < 2,048 chars, URL-escaped	Honored	Honored
`<lastmod>`	No	W3C Datetime (YYYY-MM-DD or full ISO 8601)	Used when consistently accurate; ignored if stale	Must reflect true content modification; recommends full ISO 8601 with timezone
`<changefreq>`	No	always / hourly / daily / weekly / monthly / yearly / never	Ignored entirely	Ignored
`<priority>`	No	0.0–1.0 (default 0.5)	Explicitly ignored	Ignored

Google Search Central documentation is blunt: Google ignores <priority> and <changefreq> entirely. <lastmod> is used only "if it is consistently and verifiably accurate" — Google validates it against actual page modification timestamps and builds a per-domain trust score. If your sitemap stamps every URL with today's date on every regeneration, Google will eventually stop trusting the field. Bing's guidance (July 2025) mirrors this: Bing recommends ISO 8601 with time component (2025-11-14T10:30:00+00:00), ignores changefreq and priority, and says lastmod must reflect the true last modification of the page content — not when the sitemap was regenerated. (Re-verify before launch.)

Size limits and sitemap index files

A single sitemap file: 50,000 URLs maximum, 50 MB uncompressed. Hit either limit and you need a sitemap index file (<sitemapindex>), which can reference up to 50,000 child sitemaps and theoretically cover billions of URLs. Google Search Console allows up to 500 sitemap index files per verified property. For most Shopify and WooCommerce stores with under 50,000 SKUs, a single sitemap is fine. Larger catalogs — marketplaces, multi-vendor platforms — need the index pattern immediately.

Why sitemaps matter more for AI crawlers than for Google

Site log analysis across multiple stores shows pages in the sitemap receive approximately 82% crawl coverage from AI bots; pages not in the sitemap receive around 12%. One case study: 47 blog posts added to a sitemap resulted in 31 being indexed by at least one AI crawler within three weeks, with 8 appearing in Perplexity answers. (Re-verify before launch — source: Reddit/GEO_optimization community analysis.) Sitemaps function as a crawl budget allocation tool for agents: they signal which URLs exist, and lastmod signals which URLs changed recently and deserve a revisit.

Runnable sitemap.xml — 5 representative entries, correct format

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>
    <loc>https://examplestore.com/</loc>
    <lastmod>2025-11-01T09:00:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/collections/hiking</loc>
    <lastmod>2025-10-28T14:30:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/products/osprey-atmos-65</loc>
    <lastmod>2025-11-10T08:15:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/products/merrell-moab-3-gtx</loc>
    <lastmod>2025-11-05T11:00:00+00:00</lastmod>
  </url>

  <url>
    <loc>https://examplestore.com/pages/returns</loc>
    <lastmod>2025-09-15T10:00:00+00:00</lastmod>
  </url>

</urlset>

Notes: changefreq and priority are omitted — both are ignored by Google and Bing. lastmod uses full ISO 8601 with timezone offset, per Bing's recommendation.

§3 · The Gaps

Where sitemap.xml falls short for agents.

The core limitation: sitemap.xml is a URL inventory, not a semantic map. A sitemap containing 5,000 URLs tells an agent crawler those URLs exist. It tells the agent nothing about what any of them are or can do. None of the six gaps below are solvable by adding more <url> entries — they require a separate layer.

Gap 1 · Semantic Role

No content type signal

Is /p/12345 a product, a policy page, a blog post, an API endpoint, or a checkout confirmation? The sitemap cannot tell an agent. It treats all URLs identically.

Gap 2 · Capability Signals

No action declarations

Does this store accept orders via MCP? Does it expose a SearchAction endpoint? sitemap.xml has no field for capability hints — agents must infer everything from page content.

Gap 3 · Schema Indication

No structured data preview

Which URLs have Product structured data? Which have FAQPage? Which have Event? The sitemap cannot signal this — an agent must fetch each page to find out.

Gap 4 · Render Type

HTML vs SPA vs PDF

Is the page server-rendered HTML, a JavaScript SPA that returns empty markup to headless fetchers, or a PDF? Agent crawlers that don't execute JS will receive blank pages with no warning.

Gap 5 · Language Variants

hreflang often missing

hreflang annotations live in the sitemap or page headers but are often omitted for international stores, leaving agents with US-EN content only — surfacing wrong pricing and shipping to non-US users.

Gap 6 · Pagination Noise

Near-duplicate URL flood

Faceted navigation or query parameters (?sort=price&page=3) flood the sitemap with near-duplicate URLs that dilute agent crawl budget and confuse which URL is canonical.

The Solution

A separate layer is required

llms.txt, /agents.json, and Schema.org action types each address specific gaps. No single extension to the sitemap spec solves all six — you need the full stack described in the next section.

§4 · The Full Stack

The agent-readable sitemap stack — four layers compared.

Three files work together to give agent crawlers what sitemap.xml alone cannot. A fourth file — the capability manifest — is covered in the /agents page spoke. The table below shows their current status honestly.

Layer	File	Purpose	Format	Status	Example Location
1 — Inventory	`sitemap.xml`	Full URL inventory with freshness signals via `lastmod`	XML (sitemaps.org Protocol 0.9)	Established standard	`/sitemap.xml`
2 — Navigation	`llms.txt`	Curated agent navigation: which URLs are worth reading and why	Markdown (llmstxt.org proposal)	Community proposal — no IETF/W3C RFC	`/llms.txt`
2b — Bulk Ingest	`llms-full.txt`	Concatenated full content for single-fetch ingestion by agents	Markdown (same proposal)	Community proposal	`/llms-full.txt`
3 — Capability	`/agents.json` or `/agents`	Capability declaration: MCP endpoints, actions, auth scheme	JSON (no single dominant spec yet)	Draft / vendor-specific — see /agents spoke	`/agents.json`

The fourth slot — a machine-readable capability manifest — is cross-referenced to the /agents page spoke, which covers /agents.json format and content in depth. This spoke owns the discovery and navigation layers; that spoke owns capability declaration. The robots.txt spoke owns the access control layer that governs who is allowed to fetch any of these files.

Platform Cross-Link

Shopify's native /sitemap.xml covers products, collections, pages, and blog posts automatically. WooCommerce requires Yoast SEO or RankMath to generate a compliant sitemap. BigCommerce has built-in XML sitemap generation. Headless stacks (Sanity, Contentful, Strapi) need a build-time sitemap generator. See the Shopify, WooCommerce, BigCommerce, and Headless spokes for platform-specific steps.

AgentMall · Weekly Dispatch

Agent discovery is moving fast. Stay current.

One operator-focused note per week: crawler behavior changes, spec updates, and spoke releases the morning they publish.

§5 · The Navigation Layer

llms.txt Deep Dive — origin, format, and live adoption.

Origin and status

Jeremy Howard (co-founder of Answer.AI and fast.ai) published the initial llms.txt proposal at llmstxt.org on September 3, 2024. The proposal is community-driven and hosted on GitHub. There is no IETF Internet-Draft, no W3C Working Note, and no formal standards body backing as of this writing. Adoption is bottom-up: Mintlify rolled out auto-generation for all docs sites it hosts in late 2024, which alone bootstrapped thousands of sites including Anthropic developer docs and Cursor. (Re-verify current adoption status before launch — this landscape changes rapidly.)

Exact format specification

The llms.txt file must be located at /llms.txt in the root path. It follows strict Markdown structure in a defined order:

Section	Required / Optional	Purpose	Example
`# H1` site title	Required	Identifies the site or project	`# ExampleStore`
`> blockquote` summary	Optional	One-paragraph context for the agent	`> An independent outdoor gear retailer…`
Free Markdown (no headings)	Optional	Additional background, caveats, behavioral notes	Notes about live pricing, no dropshipping, etc.
`## Section heading` + list	Optional, repeatable	Groups of curated links with descriptions	`## Products`, `## Policies`, `## API`
`## Optional` (special name)	Optional	Links skippable for shorter context windows	Blog archive, press kit, secondary docs

Each list item inside a section follows this exact pattern:

- [Link title](https://full-url): One-sentence description of what's on this page.

The description is not decorative. Agents use it as standalone context — they may decide whether to fetch the URL based on the description alone. Specifics beat generics: "Returns policy: 30-day free returns, no restocking fee, prepaid label included" is more useful than "Our returns page."

llms.txt vs llms-full.txt

/llms.txt: The curated navigation map. Small file. Points to the important URLs. A human reading it with a text editor would find it readable. IDE agents (Cursor, Claude Code, Windsurf) fetch this to locate relevant pages before diving deeper.

/llms-full.txt: The concatenated full content of every linked page, in a single Markdown file. Intended for agents that want to ingest the entire relevant content of a site in one request without following individual links. FastHTML ships both llms-ctx.txt (without Optional section links) and llms-ctx-full.txt (with all links expanded) using the llms_txt2ctx CLI tool. For a Shopify or WooCommerce merchant: llms.txt alone is usually sufficient. llms-full.txt becomes valuable for SaaS products, developer documentation, or API-heavy sites where agents need full reference material in one pull.

Verified live adoption — fetch results

Site	URL	Status	Notes
Anthropic developer docs	`https://docs.anthropic.com/llms.txt`	200 confirmed	Full structured file with 1,557 English pages listed. (Re-verify before launch.)
Anthropic main site	`https://anthropic.com/llms.txt`	404	Not present at root domain.
Stripe	`https://stripe.com/llms.txt`	200 confirmed	Full product hierarchy in Markdown sections. (Re-verify before launch.)
Cloudflare Developer Docs	`https://developers.cloudflare.com/llms.txt`	200 confirmed	Points to per-product sub-llms.txt files (nested structure). (Re-verify before launch.)
Cursor	`https://cursor.com/llms.txt`	200 confirmed	Docs sections listed; some malformed URLs present (double-domain bug noted). (Re-verify before launch.)
Perplexity	`https://perplexity.ai/llms.txt`	404	Not present.

Complete runnable llms.txt example for a commerce site

# ExampleStore — Outdoor Gear & Apparel

> ExampleStore is an independent US-based retailer of outdoor gear, hiking apparel,
> and camping equipment. We ship to all 50 states with free shipping on orders over $75.
> All products are stocked in our Portland, OR warehouse. We do not dropship.

Important notes for AI assistants:
- Product availability and pricing are live; always reference the product URL, not cached data.
- We accept Visa, Mastercard, Amex, PayPal, and Shop Pay.
- Returns are accepted within 30 days; free prepaid label provided.

## Core Store Pages

- [Homepage](https://examplestore.com/): Main landing page with featured collections and current promotions.
- [All Products](https://examplestore.com/collections/all): Full catalog browsable by category, brand, and activity.
- [Hiking Gear](https://examplestore.com/collections/hiking): Boots, trekking poles, packs, and trail clothing. 400+ SKUs.
- [Camping Equipment](https://examplestore.com/collections/camping): Tents, sleeping bags, stoves, and cookware.
- [Sale](https://examplestore.com/collections/sale): Currently discounted items; updated daily.

## Policies & Customer Service

- [Shipping Policy](https://examplestore.com/pages/shipping): Free shipping over $75 (US only), 2–5 business days standard, next-day available at checkout.
- [Returns & Exchanges](https://examplestore.com/pages/returns): 30-day return window, free prepaid label, exchange or store credit issued within 3 business days.
- [FAQ](https://examplestore.com/pages/faq): Answers to sizing, gift cards, order tracking, and wholesale inquiries.
- [Contact](https://examplestore.com/pages/contact): Email support@examplestore.com; phone 1-800-555-0192 (Mon–Fri 9am–5pm PT).

## Product Highlights

- [Osprey Atmos 65 Pack](https://examplestore.com/products/osprey-atmos-65): Top-selling backpacking pack; sizes XS–XL; currently in stock in all colors.
- [Merrell Moab 3 GTX Hiking Boot](https://examplestore.com/products/merrell-moab-3-gtx): Gore-Tex waterproof; men's and women's versions; sizes 6–14.
- [MSR WhisperLite Stove](https://examplestore.com/products/msr-whisperlite): Multi-fuel backpacking stove; includes pump and maintenance kit.

## Technical / Agent Integration

- [Sitemap](https://examplestore.com/sitemap.xml): Full URL inventory in sitemaps.org Protocol 0.9 format.
- [Agents Manifest](https://examplestore.com/agents.json): Capability endpoints, MCP server location, supported actions, and auth scheme.
- [Structured Data Overview](https://examplestore.com/pages/schema-info): Notes on Product, Offer, and SearchAction JSON-LD present on product pages.

## Optional

- [Brand Story](https://examplestore.com/pages/about): History, mission, and team behind ExampleStore. Skip if context window is tight.
- [Press Kit](https://examplestore.com/pages/press): Brand assets and media contact. Skip for product or shipping queries.
- [Blog](https://examplestore.com/blogs/news): Gear reviews, trail guides, and seasonal picks. Large archive; skim index only unless topic-specific.

§6 · Honest Assessment

OFA — what actually exists in the agent-discovery standards space.

Critical — Read Before Acting

"Open Foundation Agents" or "OFA" as a discrete, named web-discovery specification does not appear in any current standards body documentation, IETF Internet-Draft index, or major vendor's published roadmap as of this writing. Operators do not need to implement anything called "OFA" today. The term may refer to a concept in early-stage discussion that has not yet coalesced into a named, published specification. (Re-verify before launch — the standards landscape evolves quarterly.)

What does exist in the agent-discovery standards space is four active initiatives, none of which map to an HTTP sitemap extension:

Initiative	Org	Date	Scope	Relevance to sitemap/page discovery
AAIF (Agentic AI Foundation)	Linux Foundation	December 2025	Governs MCP (Anthropic), AGENTS.md (OpenAI), Goose (Block). Agent-to-tool protocol interoperability.	None — operates at tool protocol layer, not web page discovery
Agent2Agent (A2A)	Google	April 2025	Agent-to-agent communication and capability discovery via "Agent Cards" in JSON. 50+ partners.	None — agent-to-agent, not crawler-to-sitemap
DNS-AID	Linux Foundation / Infoblox	May 2026	Enables AI agents and MCP servers to use DNS as a vendor-neutral discovery directory.	None — DNS infrastructure layer, not HTTP/sitemap layer
IETF draft-narajala-ans	IETF	Active draft	Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery. DNS-based registry for agent identities.	None — not a web sitemap or llms.txt complement. (Re-verify draft status before launch.)

The closest actionable analog to what an agent-discovery extension might eventually become is the /agents.json manifest discussed in the /agents page spoke. Watch aaif.io for evolving standards. Your actionable stack today: complete sitemap.xml with accurate lastmod, a curated llms.txt, and a forward reference to /agents.json. Revisit OFA/AAIF standards quarterly.

§7 · Capability Hints

Schema.org Action Types — SearchAction, BuyAction, and Offer.

Schema.org's potentialAction property lets you attach machine-readable action descriptors directly to pages. For commerce, these types are operationally relevant as forward-looking capability signals. No current AI shopping agent executes BuyAction autonomously against arbitrary merchants today, but publishing this structured data costs nothing and positions your catalog as agent-commerce matures.

Schema.org Type	Hierarchy	Commerce Use Case	Where to Place
`SearchAction`	Thing > Action > SearchAction	Declares your site search endpoint; agent can query your catalog directly	`WebSite` object on homepage
`BuyAction`	Thing > Action > TradeAction > BuyAction	Declares that a purchase can be initiated at a specific endpoint	`Offer` object on product pages
`ReserveAction`	Thing > Action > TradeAction > ReserveAction	Reservation or hold on a product or service	`Offer` on product pages where holds are supported
`OrderAction`	Thing > Action > TradeAction > OrderAction	Placing a complete order	`Offer` on product pages with full order API

SearchAction on the WebSite object (sitewide)

This JSON-LD, placed in the <head> of your homepage, tells agents your search endpoint. An agent that parses this block knows it can query your catalog by issuing a GET request — no human interaction required.

SearchAction — WebSite object, full runnable JSON-LD

{
  "@context": "https://schema.org",
  "@type": "WebSite",
  "name": "ExampleStore",
  "url": "https://examplestore.com",
  "potentialAction": {
    "@type": "SearchAction",
    "target": {
      "@type": "EntryPoint",
      "urlTemplate": "https://examplestore.com/search?q={search_term_string}"
    },
    "query-input": "required name=search_term_string"
  }
}

BuyAction + Offer on a product page

BuyAction — Product + Offer, full runnable JSON-LD

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Osprey Atmos 65 Backpack",
  "description": "Multi-day backpacking pack with anti-gravity suspension. Available in sizes XS–XL.",
  "sku": "OSP-ATMOS65-GRN-M",
  "gtin14": "00070159040031",
  "brand": {
    "@type": "Brand",
    "name": "Osprey"
  },
  "offers": {
    "@type": "Offer",
    "url": "https://examplestore.com/products/osprey-atmos-65",
    "priceCurrency": "USD",
    "price": "289.95",
    "availability": "https://schema.org/InStock",
    "seller": {
      "@type": "Organization",
      "name": "ExampleStore"
    },
    "potentialAction": {
      "@type": "BuyAction",
      "target": "https://examplestore.com/cart/add?id=osprey-atmos-65"
    }
  }
}

For deeper structured data coverage — including AggregateRating, hasMerchantReturnPolicy, shippingDetails, and GTIN — see the product data spoke. For the full agent-ready API that these action types point to, see the API spoke. Platform-specific schema implementation is covered in the Etsy, Shopify, WooCommerce, and BigCommerce spokes.

§8 · Crawler Behavior

Per-agent crawler behavior — what each bot actually does with your sitemap.

Based on vendor documentation and observed log data. Treat specific crawl frequencies as directional, not guaranteed — OpenAI, Anthropic, and Perplexity have not published formal documentation about how their crawlers consume sitemaps. (Re-verify before launch — all log study findings are based on third-party analysis and may not reflect current behavior.)

Bot	Operator	Category	Sitemap Behavior	Notes
`GPTBot`	OpenAI	AI Training Crawler	Reads `sitemap.xml`; behavior confirmed in logs as of March 2026. Approximately 4,200 hits/day on active sites.	Blocking GPTBot does NOT block ChatGPT shopping — that is OAI-SearchBot + ChatGPT-User. See robots.txt spoke for policy guidance.
`OAI-SearchBot`	OpenAI	Agent Indexer	Purpose: surfacing sites in ChatGPT Search. Checks robots.txt 3–6×/day; sitemap behavior not explicitly documented by OpenAI.	Routes through Bing's index for ChatGPT Search results. Verify Bing Webmaster Tools as the proxy for OAI-SearchBot coverage. Allow always for commerce.
`ClaudeBot`	Anthropic	AI Training Crawler	Began reading `sitemap.xml` March 18, 2026 (confirmed across multiple site log studies). Previously ignored it. Approximately 1,800 hits/day on active sites. 85% of traffic historically on robots.txt.	Not documented by Anthropic. Operator's call on whether to allow training crawlers. See robots.txt spoke for the SEO-vs-training tension.
`Claude-SearchBot`	Anthropic	Agent Indexer	Active sitemap consumer; became 2nd most active sitemap reader after Bingbot in at least one log study. (Re-verify before launch.)	Allow always for commerce — this is the indexer that surfaces your products in Claude queries.
`PerplexityBot`	Perplexity	Agent Indexer	Primarily on-demand; fetches when user query references the domain. Burst pattern rather than scheduled crawl. Some logs show it does not request `sitemap.xml` regularly. Approximately 980 hits/day on active sites.	Allow always. Perplexity treats products without GTIN as effectively invisible in shopping results — see agent SEO spoke for the GTIN fix.
`Google-Extended`	Google	robots.txt control token	Mirrors Googlebot's index footprint; quiet baseline, 14-day revisit cadence in log studies. No separate sitemap submission — uses same sitemap infrastructure as Googlebot. Approximately 540 hits/day.	Google-Extended is a robots.txt permission token, not an independent crawler. It controls Gemini training and grounding data. Disallowing it does NOT affect Google Search rankings or Googlebot indexing.
`Bytespider`	ByteDance	AI Training Crawler	Crawls commercial pages aggressively — 1.8-day revisit cadence on retail sites observed in logs. Sitemap behavior confirmed in log analysis.	Operator's call. Some commerce operators block Bytespider as a training crawler. Confirm it is not also used as a buyer or indexer for any TikTok Shop integration before blocking. (Re-verify before launch.)
`Applebot-Extended`	Apple	AI Training Crawler	Not covered in current log study literature at the depth of the above.	Controls Apple Intelligence training data. Operator's call on whether to allow.

Verification — curl, Search Console, Bing Webmaster, IndexNow, and logs

Test what agent crawlers actually receive from your sitemap. These commands simulate the user-agent string each bot sends:

Manual verification with curl — simulate each major bot

# Check your sitemap returns 200 and valid XML
curl -I https://examplestore.com/sitemap.xml

# Simulate GPTBot fetching your sitemap
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot" \
  https://examplestore.com/sitemap.xml

# Simulate ClaudeBot
curl -A "ClaudeBot/1.0 (+claudebot@anthropic.com)" \
  https://examplestore.com/sitemap.xml

# Simulate OAI-SearchBot
curl -A "OAI-SearchBot/1.0; +https://openai.com/searchbot" \
  https://examplestore.com/sitemap.xml

# Check llms.txt is live
curl -I https://examplestore.com/llms.txt

# Check llms-full.txt (if deployed)
curl -I https://examplestore.com/llms-full.txt

If any of these return a 403 or redirect, your server or WAF may be blocking the user-agent string. Cross-reference with your bot policy to ensure you are not unintentionally blocking crawlers you want. (Re-verify user-agent strings before launch — vendors add and rename bots.)

Log analysis — grep for AI bot sitemap hits

# Find all AI bot hits to sitemap.xml in nginx/Apache combined log format
grep -iE "(GPTBot|ClaudeBot|OAI-SearchBot|PerplexityBot|Google-Extended|Claude-SearchBot)" /var/log/nginx/access.log \
  | grep "sitemap" \
  | awk '{print $1, $7, $9}' \
  | sort | uniq -c | sort -rn

Watch for: 200 on sitemap.xml — crawler received the file. 200 on llms.txt — crawler received your agent navigation layer. 404 or 403 on either — fix immediately. Sudden spike in sitemap hits from a bot you haven't seen before — new crawler rollout (the March 2026 simultaneous ClaudeBot/GPTBot sitemap adoption is a precedent).

Google Search Console

Google Search Console lets you submit sitemaps, inspect coverage, and see which URLs were indexed. Google-Extended does not have a dedicated GSC report — it piggybacks on Googlebot's infrastructure. Submit your sitemap via GSC and confirm 0 processing errors before any launch.

Bing Webmaster Tools

Bing Webmaster Tools accepts direct sitemap submission and provides submission status, last-read date (confirms Bing fetched the sitemap), and processing errors. Since OAI-SearchBot routes through Bing's index for ChatGPT Search results, verifying Bing indexing is the proxy for OAI-SearchBot coverage.

IndexNow for real-time URL push

Bing and Yandex support IndexNow — a real-time URL-level push protocol. When you update a product page, push the URL to IndexNow immediately rather than waiting for Bing's scheduled sitemap crawl. Shopify and many WooCommerce plugins support IndexNow natively. (Re-verify availability per platform before launch.)

§9 · Common Mistakes

Eight ways sitemap and llms.txt configurations fail in production.

1. Outdated lastmod values

Regenerating your sitemap daily and stamping every URL with today's date destroys Google's and Bing's trust in your freshness signals. They stop honoring the field — and AI crawlers that rely on lastmod to decide revisit priority will treat your entire catalog as equally stale.

<!-- Wrong: sitemap generation timestamp -->
<lastmod>2025-11-14</lastmod>

<!-- Right: actual page content modification timestamp -->
<lastmod>2025-09-22T11:30:00+00:00</lastmod>

Set lastmod programmatically from your CMS's updated_at field, not from Date.now() in your sitemap generator.

2. Paginated product URLs flooding the sitemap

Shopify and WooCommerce collection pages generate /collections/hiking?page=2, /collections/hiking?sort_by=price, etc. Listing these in the sitemap wastes agent crawl budget on near-duplicate content and confuses which URL is canonical. Fix: Include only canonical collection landing pages. Exclude query-string variants with ?page=, ?sort=, ?filter=. In Shopify, use a custom sitemap template or a sitemap app that respects canonical logic. In WooCommerce, Yoast SEO excludes paginated variants by default when configured correctly.

3. One giant sitemap exceeding 50,000 URLs or 50 MB

Google and Bing stop processing a sitemap once it hits either limit. The fix is a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://examplestore.com/sitemap-products.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://examplestore.com/sitemap-pages.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://examplestore.com/sitemap-collections.xml</loc>
    <lastmod>2025-11-14</lastmod>
  </sitemap>
</sitemapindex>

4. llms.txt pointing to 404 URLs

An llms.txt pointing to 404 URLs is worse than no llms.txt — agents waste fetch budget on dead links and may penalize the file's reliability. Run a link checker as part of your deployment pipeline:

curl -s https://examplestore.com/llms.txt | grep -oP 'https?://[^\)]+' | while read url; do
  status=$(curl -o /dev/null -s -w "%{http_code}" "$url")
  echo "$status $url"
done | grep -v "^200"

5. Not declaring the sitemap in robots.txt

Some AI crawlers (checking robots.txt before crawling) will only discover your sitemap if it's declared there. If you only submitted it through Google Search Console, agents that don't use GSC (which is all of them except Googlebot) won't find it automatically. Fix: Add to every site's robots.txt:

Sitemap: https://examplestore.com/sitemap.xml

6. Gzipping sitemap.xml.gz without declaring it properly

Sitemaps.org supports .gz compression, but the file must be named correctly and the Sitemap directive in robots.txt must point to the .gz URL, not the uncompressed one. Verify with curl -I that the response includes Content-Encoding: gzip or that the .gz file downloads as a valid gzip archive.

# In robots.txt — point to the actual file served
Sitemap: https://examplestore.com/sitemap.xml.gz

7. llms.txt written as marketing copy instead of agent-targeted navigation

"We're the #1 choice for outdoor adventurers who demand quality" tells an agent nothing useful. The file should be navigation and context, not branding. Every line in llms.txt should answer: "What is on this page and why would an agent need it?"

# Wrong — marketing copy
- [Products](https://examplestore.com/collections/all): Explore our incredible selection of top-quality outdoor gear!

# Right — agent navigation
- [All Products](https://examplestore.com/collections/all): Full catalog; 400+ SKUs across hiking, camping, and apparel. Filterable by brand, activity, and price.

8. Forgetting hreflang annotations for international stores

If you sell to Canada, UK, or Australia with localized URLs, omitting hreflang in your sitemap means agent crawlers may only find and index your US-EN pages — surfacing US pricing and shipping terms to non-US users. Fix: add hreflang links to your sitemap entries and include the xmlns:xhtml namespace declaration in your <urlset> tag:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://examplestore.com/products/osprey-atmos-65</loc>
    <lastmod>2025-11-10T08:15:00+00:00</lastmod>
    <xhtml:link rel="alternate" hreflang="en-us" href="https://examplestore.com/products/osprey-atmos-65"/>
    <xhtml:link rel="alternate" hreflang="en-ca" href="https://examplestore.ca/products/osprey-atmos-65"/>
  </url>
</urlset>

§10 · FAQ

Frequently asked questions.

Do AI agents actually use my sitemap?

Yes, and more than you might expect. Site log analysis consistently shows that AI crawlers use sitemaps as a primary discovery mechanism — more so than Googlebot, which has 25+ years of link-graph data to fall back on. Pages in the sitemap receive approximately 82% crawl coverage from AI bots; pages not in the sitemap receive around 12%. ClaudeBot and GPTBot both began reading sitemap.xml in March 2026 for the first time after years of ignoring it, suggesting continued evolution in how AI crawlers use the file. (Re-verify before launch — crawler behavior evolves.)

What is the difference between sitemap.xml and llms.txt?

sitemap.xml is an inventory — it lists every URL on your site that should be indexed, with freshness signals. It tells agents that URLs exist and when they were last changed. It is entirely URL-focused, with no semantic meaning attached to any individual URL. llms.txt is a navigation guide — it tells agents which of those URLs are worth reading, in what order, and why. It is curated (not exhaustive), written in Markdown, and designed for agent inference at the moment a user asks a question. The two complement each other: sitemap for coverage, llms.txt for signal.

Do I need llms-full.txt or is llms.txt enough?

For most merchants: llms.txt alone is enough. llms-full.txt is valuable when your most important content is documentation or API reference — places where an agent needs to ingest large amounts of technical detail in a single fetch. If you run a documentation-heavy site, developer tool, or SaaS product, ship both. Mintlify, Fern, GitBook, and Vercel Docs generate both automatically for all hosted sites.

How often should lastmod update?

Update lastmod only when the actual content of the page changes. If you update a product's price, description, or availability, update lastmod. Do not update it when you regenerate the sitemap file itself, change navigation, or make backend-only changes that don't affect page content. Google and Bing both validate lastmod against actual page modification history; consistent accuracy builds trust and causes the field to be honored. Inconsistent stamping causes it to be ignored.

Will agents see products if they're behind JavaScript rendering?

No. Agent crawlers are overwhelmingly headless HTTP fetchers that do not execute JavaScript. A Shopify store that server-renders product pages is fine. A React SPA that returns empty <div id="app"></div> to non-JS fetchers is invisible to agent crawlers. Headless commerce implementations using Next.js or Nuxt with server-side rendering are generally safe. Your llms.txt should link to URLs that return clean, parseable HTML or Markdown to a headless GET request — not SPA routes that require JS to populate. Test with curl -A "ClaudeBot/1.0" https://yourstore.com/products/example to see what the agent actually receives.

Does Google-Extended use the same sitemap as Googlebot?

Yes. Google-Extended is a user-agent token — a permission layer, not a separate crawler infrastructure. It operates on top of Googlebot's existing crawl. Web publishers use the Google-Extended robots.txt token to control whether Google can use their content for Gemini model training and grounding. There is no separate sitemap submission for Google-Extended, no separate Search Console view, and no additional sitemap configuration needed. Your existing sitemap infrastructure covers it.

How big can a sitemap get before I need a sitemap index?

A single sitemap file has two limits: 50,000 URLs and 50 MB uncompressed. Hit either limit and you need a sitemap index. For a Shopify store with 500 products, a standard single sitemap is fine. For a marketplace or large catalog with tens of thousands of SKUs, a sitemap index with separate files per content type (products, collections, pages, blog posts) is the right architecture. Google Search Console supports up to 500 sitemap index files per verified property.

Is OFA something I need to implement now?

No. As documented in Section 6, "Open Foundation Agents" or "OFA" as a discrete, named web-discovery specification does not appear in any current standards body documentation or major vendor's published roadmap. The active agent-interoperability standards (AAIF/MCP, Agent2Agent, DNS-AID) operate at the agent-to-agent and agent-to-tool protocol layer, not the page-discovery layer. Your actionable stack today is: complete sitemap.xml with accurate lastmod, a curated llms.txt, and a forward reference to /agents.json (covered in the /agents page spoke). Revisit OFA/AAIF standards quarterly as they evolve.

§11 · Step-by-Step

The sitemap build, in five steps.

Each step mirrors the HowTo JSON-LD at the top of this page word for word. Execute in order. Most operators can complete all five steps in a single focused afternoon.

Step 1 — Audit current sitemap.xml coverage

Pull your existing sitemap and cross-check it against your actual URL inventory. Count URLs in current sitemap with: curl -s https://examplestore.com/sitemap.xml | grep -c loc. Check for HTTP 200 on a sample of product URLs. Confirm: every published product page is listed; paginated and filtered variants are excluded; 404 and redirect chains are cleaned up; the file doesn't exceed 50,000 URLs.

# Count URLs in current sitemap
curl -s https://examplestore.com/sitemap.xml | grep -c loc

# Check for HTTP 200 on a sample of product URLs
curl -s https://examplestore.com/sitemap.xml | grep -oP '(?<=<loc>)[^<]+' | shuf | head -20 | while read url; do
  echo "$(curl -o /dev/null -s -w '%{http_code}') $url"
done

Step 2 — Add accurate lastmod and ensure product URLs are present

For Shopify: use a sitemap app (Sitemap by Slayback, or native Shopify sitemap at /sitemap.xml) and verify lastmod maps to updated_at from the Shopify Storefront API. For WooCommerce: Yoast SEO or RankMath generate compliant sitemaps by default; verify the lastmod source in the plugin settings. For headless: generate sitemap server-side using a build-time sitemap generator that reads updated_at from your commerce platform's API.

Step 3 — Author /llms.txt with curated agent navigation

Create the file following the exact format in Section 5. Include: site title (H1), one-paragraph context blockquote, important behavioral notes (pricing is live, no dropshipping, etc.), sections for core pages, policies, product highlights, and agent/technical links. Use specific, factual descriptions for every link. Deploy at https://yourstore.com/llms.txt. Verify with curl -I https://yourstore.com/llms.txt returns 200.

Step 4 — Declare sitemap in robots.txt and cross-link from /agents.json

Add or confirm: Sitemap: https://yourstore.com/sitemap.xml in your robots.txt. Add a Technical section in llms.txt pointing to your sitemap and to /agents.json. The agents manifest (covered in the /agents page spoke) should in turn reference llms.txt. These three files form a self-referencing discovery graph that any agent can navigate starting from any entry point.

Step 5 — Verify with curl + Search Console + log analysis

Curl each bot user-agent against sitemap.xml and llms.txt using the commands in Section 8. Submit sitemap to Google Search Console; confirm 0 processing errors. Submit sitemap to Bing Webmaster Tools; confirm last-read date updates within 24 hours. Set up a weekly log grep and baseline AI bot sitemap hit rates. If ClaudeBot or GPTBot drops to zero for 14+ consecutive days, something regressed in robots.txt or sitemap formatting.

§12 · Continue the Guide

Next stops in the AgentMall guide.

Discovery Spoke 1

robots.txt + agents.txt

The bot policy layer: allow the agent buyers and indexers, make an informed decision on training crawlers, and get the full 18-bot directory with user-agent strings.

Discovery Spoke 4

Add an /agents Page

The capability manifest that tells agents what you do, what endpoints you expose, how to authenticate, and what actions are available — the forward reference your llms.txt Technical section points to.

Pillar

The Full AgentMall Roadmap

The pillar page that ties all four discovery spokes, the four-layer agent-ready model, and every platform spoke together into one 30-day operator plan.

The Window

The agents that can't find your store can't buy from it.

Every day your sitemap is incomplete, your llms.txt is missing, or your lastmod values are wrong, AI crawlers are spending their budget on your competitors' pages instead of yours. Googlebot had decades to build a link-graph model of the web. The new generation of AI crawlers is relying on sitemaps and llms.txt right now — because they don't have that history yet. The merchants who get these files right first build an early crawl-budget advantage that compounds as agent traffic grows. This is not a long build. It is a focused afternoon. Start with the sitemap audit in Step 1.

Open the AgentMall Roadmap →

The Agent-Readable Sitemap — Extending sitemap.xml + llms.txt + OFA

Discovery is Problem #1 — agents can't buy what they can't find.

sitemap.xml Protocol 0.9 — what the spec actually says.

Size limits and sitemap index files

Why sitemaps matter more for AI crawlers than for Google

Where sitemap.xml falls short for agents.

No content type signal

No action declarations

No structured data preview

HTML vs SPA vs PDF

hreflang often missing

Near-duplicate URL flood

A separate layer is required

The agent-readable sitemap stack — four layers compared.

Agent discovery is moving fast. Stay current.

llms.txt Deep Dive — origin, format, and live adoption.

Origin and status

Exact format specification

llms.txt vs llms-full.txt

Verified live adoption — fetch results

Complete runnable llms.txt example for a commerce site

OFA — what actually exists in the agent-discovery standards space.

Schema.org Action Types — SearchAction, BuyAction, and Offer.

SearchAction on the WebSite object (sitewide)

BuyAction + Offer on a product page

Per-agent crawler behavior — what each bot actually does with your sitemap.

Verification — curl, Search Console, Bing Webmaster, IndexNow, and logs

Google Search Console

Bing Webmaster Tools

IndexNow for real-time URL push

Eight ways sitemap and llms.txt configurations fail in production.

1. Outdated lastmod values

2. Paginated product URLs flooding the sitemap

3. One giant sitemap exceeding 50,000 URLs or 50 MB

4. llms.txt pointing to 404 URLs

5. Not declaring the sitemap in robots.txt

6. Gzipping sitemap.xml.gz without declaring it properly

7. llms.txt written as marketing copy instead of agent-targeted navigation

8. Forgetting hreflang annotations for international stores

Frequently asked questions.

Do AI agents actually use my sitemap?

What is the difference between sitemap.xml and llms.txt?

Do I need llms-full.txt or is llms.txt enough?

How often should lastmod update?

Will agents see products if they're behind JavaScript rendering?

Does Google-Extended use the same sitemap as Googlebot?

How big can a sitemap get before I need a sitemap index?

Is OFA something I need to implement now?

The sitemap build, in five steps.

Step 1 — Audit current sitemap.xml coverage

Step 2 — Add accurate lastmod and ensure product URLs are present

Step 3 — Author /llms.txt with curated agent navigation

Step 4 — Declare sitemap in robots.txt and cross-link from /agents.json

Step 5 — Verify with curl + Search Console + log analysis

Next stops in the AgentMall guide.

robots.txt + agents.txt

Add an /agents Page

The Full AgentMall Roadmap

The agents that can't find your store can't buy from it.

One AgentMall note per week.