AI Crawler Access

An AuditSpark.io whitepaper on how robots.txt, bot policy, and preview controls decide whether answer engines can see you at all.

Website intelligence that sparks action.

TL;DR

AI visibility begins before content quality. If an answer engine cannot reach your pages, nothing else you do to your content matters.
The major AI providers run several crawlers each, and they do different jobs. One handles model training, another surfaces you in the engine's search features, another fetches a page when a live user asks. Blocking the wrong one has uneven and often surprising effects.
A blanket "block all bots" rule, frequently added years ago to stop scrapers, can quietly remove you from the search-oriented crawlers that put you into AI answers while doing nothing about training.
Access controls are advisory, not a firewall. robots.txt asks well-behaved crawlers to comply. Some user-triggered fetchers behave like a browser and generally ignore it, and compliance overall depends on the operator.
Crawler policy is now a business decision with three distinct levers: what should be findable, what may be used for training, and what should be protected. Google-Extended and Cloudflare's Content Signals exist precisely to separate those levers.
This is auditable today. Most sites have never checked it. AuditSpark.io reviews access across the relevant crawlers and flags the rules that are costing you visibility.

Start here: run a free AuditSpark.io AI and GEO Readiness audit to see whether your site is reachable by the crawlers that feed AI answers.

Executive summary

The first paper in this series argued that websites now compete for inclusion, citation, and recommendation inside generated answers, and that a site can rank well and still be absent from the answer. This paper goes one layer deeper, into the most mechanical and most overlooked of those failure modes: access.

Before an answer engine can understand you, trust you, or quote you, it has to be allowed to fetch your pages. That permission is granted or denied by a small set of controls that most teams set once and forget: robots.txt rules, content-delivery-network and firewall bot policies, and snippet directives like noindex and nosnippet. In the link era these controls mattered mainly for Googlebot. In the AI era they govern a growing roster of crawlers from OpenAI, Anthropic, Google, Perplexity, and others, each with a specific job.

The critical insight is that these crawlers are not interchangeable. OpenAI separates GPTBot, which gathers training data, from OAI-SearchBot, which surfaces sites in ChatGPT's search features, from ChatGPT-User, which fetches a page when a person asks. Anthropic separates ClaudeBot, Claude-User, and Claude-SearchBot along similar lines. Google separates Googlebot, which governs Search indexing, from Google-Extended, which governs whether your content trains Gemini and is used for grounding, and which Google states does not affect Search rankings. The practical consequence is that a single rule applied to all bots almost never expresses what a business actually wants, because the business usually wants to be findable, may or may not want to allow training, and wants to keep some things private. Those are three different decisions, and the tooling now lets you make them separately.

There is one more nuance that matters for honest guidance. These controls are advisory. robots.txt is a request that well-behaved crawlers honor, not a lock. Some user-triggered fetchers behave like a browser and generally ignore robots.txt, and overall compliance depends on the crawler operator. So access auditing is partly about making sure you are not accidentally blocking yourself, and partly about understanding that "blocked" and "absent" are not guaranteed to be the same thing.

This paper gives you the crawler matrix, the audit checks for Layer 1 of the AuditSpark.io framework, the common misconfigurations that silently cost visibility, a new advisory service agencies can build from this, and a ninety-day plan to get access right.

Why it matters now

Three shifts make access a present-tense problem rather than a housekeeping chore.

The crawler landscape multiplied. Two years ago, robots.txt strategy meant Googlebot and a short list of others. Today a single site is visited by training crawlers, search crawlers, and user-triggered fetchers from several providers, each independently controllable. The number of distinct decisions a site owner can make about access has grown faster than most teams have updated their robots.txt, which means a large share of files in production reflect a world that no longer exists.

Defensive blocks from the scraping panic are still live. When generative AI broke into the mainstream, many sites and agencies added blanket disallows to keep their content out of training data. That was a reasonable reaction at the time. The problem is that a blunt block often caught the search-oriented crawlers too, so some of those sites opted themselves out of AI answer visibility as a side effect, without ever intending to. Those rules are still sitting in production, doing damage nobody is watching.

Governance tooling arrived, which means inaction is now a choice. Google-Extended gives publishers a training-specific control that does not touch Search. Cloudflare introduced AI Crawl Control and a Content Signals policy that expresses intent in robots.txt across distinct categories. The existence of these tools changes the framing. When the only available control was a crude all-or-nothing block, a messy policy was understandable. Now that granular controls exist, a misconfigured access policy is a decision you are making by default, and it is one you can fix deliberately.

The urgency is real but it is not dramatic. Nobody needs to panic. The point is that access is the cheapest layer to get right and the most expensive to ignore, because a single wrong line can negate every other improvement you make.

Technical validation

This section lays out the documented behavior of the major crawlers. It is the most factual part of the series, and every claim here traces to provider documentation reviewed on 2026-06-26. Specifics can change, so treat the documentation as the source of truth at the time you act.

The crawler matrix

Crawler / token	Operator	Documented purpose	If you block it
Googlebot	Google	Crawls and indexes for Google Search	You can lose Search indexing, which also underlies eligibility for AI Overviews and AI Mode
Google-Extended	Google	Controls use of your content to train Gemini and for grounding	Training and grounding use is curtailed; Google states Search ranking and inclusion are not affected
GPTBot	OpenAI	Crawls content that may be used for model training	Training use is curtailed; this is not the bot that surfaces you in ChatGPT search
OAI-SearchBot	OpenAI	Surfaces websites in ChatGPT search features	You can lose eligibility to be surfaced in ChatGPT search results
ChatGPT-User	OpenAI	Fetches a page when a user action in ChatGPT requests it	User-initiated fetches may be refused; behaves like a user-triggered request, not bulk crawling
ClaudeBot	Anthropic	Crawls publicly available content for model training	Training use is curtailed
Claude-User	Anthropic	Fetches pages when a Claude user directs it to	User-directed retrieval may be reduced
Claude-SearchBot	Anthropic	Crawls to improve Claude search result quality	Search quality and visibility within Claude may be reduced
PerplexityBot	Perplexity	Builds and updates the index behind cited answers	You can lose discoverability in Perplexity's cited answers
Perplexity-User	Perplexity	Fetches a page when a live user asks	Behaves like a browser request and generally does not obey robots.txt

The single most important reading of this table: the training crawlers and the search crawlers are different. Blocking GPTBot does not remove you from ChatGPT search, because that is OAI-SearchBot's job. Blocking ClaudeBot does not govern Claude's user-directed retrieval, because that is Claude-User. A team that wants to keep its content out of training while remaining visible in AI answers can express exactly that, but only if it controls the bots individually instead of with one sweeping rule.

The user-triggered exception

A second pattern cuts across providers. The user-triggered fetchers, ChatGPT-User, Claude-User, and Perplexity-User, behave like a person clicking a link rather than like an automated crawler. Documentation indicates these can fetch a page when a live user asks, and at least Perplexity-User is described as generally not following robots.txt, since it is treated as a user request. This means a robots.txt rule may govern bulk crawling and indexing while having limited or no effect on a one-off, user-initiated fetch. It is a key reason the simple mental model of "blocked equals invisible" does not hold cleanly.

Snippet and preview controls

Beyond reachability, Google documents controls over how much of a page can be shown: noindex removes a page from the index entirely, nosnippet suppresses a text snippet, data-nosnippet marks specific passages to withhold, and max-snippet limits snippet length. Because eligibility to appear in Google's AI features depends on being indexable and eligible to show with a snippet, an overly aggressive nosnippet or max-snippet setting can reduce your eligibility for those surfaces even when the page is technically crawlable. Access is therefore not only "can they fetch it" but also "how much are they permitted to show."

Access controls are advisory, not enforcement

robots.txt and content signals are requests, honored by cooperative crawlers, not technical barriers. Cloudflare's Content Signals policy makes site intent machine-readable across categories such as search, ai-input for real-time generative answers, and ai-train for training, with a default expressing search yes and ai-train no and the ai-input category left neutral. Cloudflare itself notes these signals are advisory and not technically enforceable. There has also been public reporting and dispute about whether some crawlers fully respect these directives. The honest position for an audit is twofold: first, make sure your own configuration is not blocking the visibility you want, which you fully control, and second, recognize that true content protection from non-compliant actors requires enforcement at the network or firewall level, not a polite line in a text file.

Business impact

Access problems are uniquely costly because they are silent, they are upstream of everything, and they are usually accidental.

They negate every downstream investment. If you spend a quarter restructuring content for citation-worthiness and improving trust signals, but OAI-SearchBot is blocked at the firewall, the work cannot show up in ChatGPT search. Access sits above understandability, trust, and citation in the stack, so a single bad rule can zero out the return on everything beneath it. This is why an access audit is the highest-leverage hour in a GEO program.

The loss is invisible in standard tools. A crawler that is disallowed simply does not appear, and an answer that omits you leaves no impression or click in analytics. There is no error message that says "an AI engine tried to reach you and was turned away." Unless someone inspects the rules and the logs, an accidental block can persist indefinitely while traffic looks normal.

Most damage is self-inflicted and historical. The common pattern is not a deliberate, considered policy. It is a defensive rule added during the scraping panic, a security plugin's aggressive default, or a CDN bot-management setting that nobody mapped to AI visibility. These are easy to find once you look and easy to fix once found, which is the good news inside the bad news.

The protection question is a real business decision, not only a visibility one. Some organizations genuinely want their content excluded from model training for rights or competitive reasons. That is legitimate, and the tooling now supports it. The risk is conflating "keep out of training" with "be invisible to AI answers," and accidentally choosing the second when you meant the first. Treating training permission, search visibility, and content protection as three separate decisions is what turns a vague anxiety into a clear, defensible policy.

The practical audit framework

This paper expands Layer 1, Access, of the AuditSpark.io six-layer framework. The other five layers are covered across the rest of the series. The goal of an access audit is to answer one question with evidence: can the crawlers you want to reach you actually reach the content you want them to see, and are you withholding anything by accident.

Work through access in five passes.

1. Reachability of the site itself. Confirm the server returns healthy status codes, the site is not gated behind something that requires a login or a script to show core content, and important text is present in the served HTML rather than only after client-side rendering. A page that needs JavaScript to display its main content can be invisible to crawlers that do not execute scripts.

2. robots.txt intent versus reality. Read the actual robots.txt in production, not the one you think is deployed. Map each rule to the crawler matrix above. For every disallowed user-agent, ask whether that block reflects a current, deliberate decision or a leftover. Pay special attention to blanket rules that apply to all user-agents, since those are where accidental visibility loss usually hides.

3. CDN and firewall bot policy. robots.txt is not the only gate. Content-delivery networks and web application firewalls often have their own bot-management rules that can block or challenge crawlers before robots.txt is ever read. A site can have a perfectly permissive robots.txt and still be blocking AI crawlers at the edge. Check both layers.

4. Snippet and preview directives. Inventory noindex, nosnippet, data-nosnippet, and max-snippet usage across templates. Confirm that pages you want surfaced are indexable and allowed to show a snippet, and that any restrictions are intentional rather than a global default someone set long ago.

5. Policy coherence. Step back and state, in plain language, the three decisions: what should be findable by AI search crawlers, what may be used for training, and what should be protected. Then confirm the configuration actually expresses that. This is where Google-Extended, individual search and training bots, and content signals get reconciled into one intentional policy instead of a pile of historical rules.

A practical default for a business that wants AI visibility and is relaxed about training: allow the search-oriented crawlers and Googlebot, decide training case by case using the training-specific tokens, and reserve hard blocks for genuinely sensitive areas, enforced at the network level rather than trusted to robots.txt alone.

The agency and freelancer opportunity

Access is the easiest GEO service to sell, because it is concrete, it is fast, and almost no client has done it.

The pitch is simple and credible: most websites have never been checked to see whether AI engines are even allowed to reach them, and a single old rule can be quietly removing them from AI answers. You are not promising rankings or citations. You are offering to find and fix a specific, demonstrable technical problem, which is the kind of work clients trust and pay for.

A productizable offering, an "AI Crawler Policy Review," can stand on its own or open a larger engagement. Deliver three things. First, a current-state report that maps the client's robots.txt, CDN, and firewall rules against the crawler matrix and flags every block that is costing visibility. Second, a recommended policy that separates the three decisions, findable, trainable, protected, into an explicit, written stance the client can approve. Third, the implementation, the corrected robots.txt and edge rules, plus a short document the client can keep so the policy is not lost the next time someone touches the stack.

This work is well-suited to a fixed fee because the scope is bounded, and it naturally leads to the higher layers. Once you have proven you can find a hidden visibility problem, the conversation about understandability, trust, citation-worthiness, and ongoing AI visibility measurement becomes much easier. Access is the wedge. The fifth paper in this series details how to package and price the full ladder.

Common mistakes

Blocking all bots to stop scrapers. The instinct is understandable, but a blanket disallow can opt you out of the search crawlers that feed AI answers while doing little to stop non-compliant actors. Separate the decisions.

Assuming blocking the training bot hides you from the assistant. Blocking GPTBot or ClaudeBot governs training, not the engine's search or user-directed retrieval. Those are different bots. If your goal was visibility control, you may have changed nothing that matters and missed what does.

Treating robots.txt as enforcement. It is a request honored by cooperative crawlers. User-triggered fetchers may ignore it, and real protection requires network-level enforcement. Do not rely on a text file to keep sensitive content truly private.

Forgetting the CDN and firewall. A clean robots.txt means nothing if the edge is challenging or blocking crawlers first. Audit both layers, every time.

Over-restricting snippets. Aggressive nosnippet or max-snippet settings can reduce eligibility for the surfaces you want, since those surfaces depend on being allowed to show a snippet. Confirm restrictions are deliberate.

Confusing Google-Extended with Search. Blocking Google-Extended controls Gemini training and grounding and, per Google, does not affect Search ranking or inclusion. Teams sometimes avoid the control out of a misplaced fear of hurting rankings, or block it without realizing it is training-only. Know which lever you are pulling.

Setting it once and never revisiting. The crawler roster and provider policies change. An access policy that was correct a year ago may be silently wrong today. Re-audit on a cadence.

The 30, 60, 90 day action plan

Days 1 to 30, find and fix accidental blocks. Pull the live robots.txt and read every rule against the crawler matrix. Identify any blanket disallows and any blocks on search-oriented crawlers, OAI-SearchBot, Claude-SearchBot, PerplexityBot, and confirm whether each is intentional. Check the CDN and firewall for bot-management rules that block or challenge these crawlers at the edge. Remove or correct the accidental blocks. Confirm core content is present in served HTML and not dependent on client-side rendering. This pass alone resolves most access problems.

Days 31 to 60, set deliberate policy. Write down the three decisions in plain language: what should be findable, what may be used for training, what must be protected. Reconcile the configuration to match, using the training-specific tokens such as Google-Extended and GPTBot for the training decision and the search crawlers for the visibility decision. Review snippet and preview directives so the pages you want surfaced are eligible. Where content genuinely must be protected, move that enforcement to the network or firewall level rather than relying on robots.txt.

Days 61 to 90, document and monitor. Produce a short written access policy so the decisions survive staff changes and site migrations. Set up a way to notice future drift, whether that is periodic re-auditing, crawler-access monitoring, or simply a recurring calendar review. Then connect access to the rest of the program by moving up the stack to understandability and trust, knowing the foundation is now sound.

Checklist

Reachability

Server returns healthy status codes on key pages
Core content is in the served HTML, not only rendered by JavaScript
No login or interaction required to reach primary content
Sitemap present, current, and referenced in robots.txt

robots.txt

You have read the live production file, not an assumed version
Every disallowed user-agent maps to a deliberate decision
No blanket block that unintentionally catches search crawlers
Search crawlers you want (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot) are allowed

Edge and firewall

CDN bot-management rules reviewed for crawler blocking or challenges
Web application firewall not silently blocking AI crawlers
Rate-limiting not so aggressive it looks like a block to crawlers

Snippet and preview

No unintended noindex on pages you want surfaced
nosnippet and max-snippet used only where deliberate
data-nosnippet applied only to passages you truly want withheld

Policy coherence

The findable decision is explicit and implemented
The training decision is explicit, using training-specific tokens
The protect decision is enforced at the network level, not just robots.txt
The policy is written down and has a review cadence

See where your own site stands

Run a free AuditSpark AI & GEO Readiness audit — score, executive summary, and the fixes that matter, in minutes.

Run a free audit →