CustomGPT.ai Blog

Why does an AI Crawler visit first?

Many well-behaved crawlers fetch /robots.txt early because it’s the standard place to learn where they’re allowed to crawl and (often) where to find a site’s sitemap. For example, Google states its crawlers download and parse robots.txt before crawling to determine what may be crawled.

That said, “AI crawler” isn’t one behavior: some operators run multiple bots for search, training, and user-requested retrieval, each with different rules and controls.

Try CustomGPT with the 7-day free trial to validate ingestion.

TL;DR

AI crawlers often request /robots.txt first to quickly determine crawl permissions (REP) and discover high-value entry points like sitemaps, before spending requests on content URLs.

Check robots.txt, sitemap, and rate limits to reduce unwanted crawling.

What “Visit First” Means

In this article, “visit first” means: the first HTTP request you observe for a crawler identity (user-agent and/or IP) in your logs, based on the earliest timestamp.

It does not mean “first indexed” or “first used in an AI answer.” Crawling and indexing are separate processes, and indexing isn’t guaranteed even if a URL is crawled.

Why /robots.txt Is A Common First Request

1) It Answers “Am I Allowed To Crawl Here?”

The Robots Exclusion Protocol (REP) defines how site owners publish crawler rules in a file named robots.txt (commonly accessed at /robots.txt) and makes clear that these rules are not access authorization, they’re advisory guidance for crawlers.

2) It’s A Fast, Low-Cost “Rules Check” Before Content Fetches

Fetching a small text file first is cheaper than testing many pages and then discovering they were disallowed. For Google specifically, the robots file is fetched via a non-conditional GET and the rules apply only to the specific host/protocol/port where it’s served.

The Other Common “First URLs”

A Sitemap

Many crawlers will look for sitemaps because they list URLs efficiently and reduce guesswork. The sitemap protocol is defined at (sitemaps.org protocol), and Google’s robots spec supports declaring one or more sitemap URLs via Sitemap: lines in robots.txt.

A Seed URL

If a crawler doesn’t find or use a sitemap, it may start from a known seed (commonly the homepage) and expand via internal links.

Important qualifier: If you see the homepage as the first hit, it can mean the bot already had robots rules cached, already knew a seed URL, or didn’t (or couldn’t) use a sitemap.

When /robots.txt Is Not The First Hit

Two common reasons:

  1. Caching: Some crawlers cache robots.txt and won’t refetch it on every crawl. Google notes robots.txt is generally cached (often up to 24 hours, sometimes longer depending on conditions).
  2. User-Triggered Fetchers: Some bots fetch pages because a user asked, not because the bot is doing an automatic crawl. OpenAI notes ChatGPT-User may visit pages for user actions and that “robots.txt rules may not apply” in that scenario, Google similarly distinguishes cases where REP isn’t applicable to certain user-controlled crawlers.

How To Verify What An AI Crawler Visited First

  1. Pull raw server/CDN/WAF logs for the time window you care about.
  2. Filter by user-agent (and IPs, if you have operator-published ranges).
  3. Sort by timestamp ascending.
  4. Record the first URL path and HTTP status.
  5. Trace the next 20–100 requests to see whether it moved to:
    • /robots.txt
    • a sitemap (from Sitemap: in robots.txt or a known sitemap location)
    • the homepage
    • specific content paths

Tip: Don’t trust user-agent strings alone for high-stakes decisions. Prefer operator guidance for verification (for example, OpenAI publishes bot IP ranges for some user agents).

How To Influence What Crawlers Do First

Serve A Real robots.txt

  • Ensure /robots.txt exists and returns a clean response (avoid HTML error pages).
  • Keep rules readable and correct; robots is guidance, not security.

Publish A Sitemap And Reference It

  • Generate a sitemap that lists canonical, preferred URLs (sitemaps.org protocol).
  • Add one or more Sitemap: lines in robots.txt so crawlers can discover it quickly.

Remove Crawl Traps Early

Common crawl traps that waste early requests:

  • infinite faceted navigation
  • calendars with unbounded pages
  • session IDs / parameter permutations
  • duplicate URL variants (http/https, www/non-www, trailing slash variants)

Rate Limit Carefully

If load is the issue, apply rate limits at the edge after deciding which bots you want:

  • Blocking robots.txt can be counterproductive: some crawlers change behavior depending on robots status codes. Google documents different handling for 2xx/3xx/4xx/5xx responses.

Crawl-Delay Is Not Universal

  • Some operators support Crawl-delay (non-standard). Anthropic explicitly states it supports crawl-delay for its bots.
  • Google’s robots spec notes fields like crawl-delay aren’t supported by Google’s crawlers.

How This Relates To CustomGPT.ai Website Ingestion

If your goal is to control what CustomGPT indexes when you build an agent from your website, you can provide a URL or sitemap and adjust where crawling begins.

  • To create an agent from a website URL or sitemap, follow: Create AI Agent From Website.
  • If no sitemap is found, CustomGPT defaults to recursive crawling from the homepage of the provided domain.
  • If you want it to start from the exact URL you entered, use the “Not what you expected?” option to select “Start crawling from the provided URL rather than the home page”.

Example: Explaining A “First Visit” Sequence In Logs

If your logs show:

  • 09:00:01 GET /robots.txt 200
  • 09:00:02 GET /sitemap.xml 200
  • 09:00:04 GET / 200
  • 09:00:06 GET /docs/ 200
  • 09:00:08 GET /docs/getting-started 200

A reasonable interpretation is:

  • The bot started with policy discovery (robots.txt) and then URL discovery (sitemap).
  • Then it fetched hub pages to expand discovery and prioritization.

If the first hit is the homepage instead (and you expected a deep page), common causes include: robots rules already cached, sitemap not found/used, or a seed-URL crawl strategy.

Conclusion

AI crawlers often hit /robots.txt first to learn crawl rules and find sitemaps before spending requests on pages. Use CustomGPT.ai to ingest from a URL or sitemap and validate coverage with the 7-day free trial.

FAQ

If A Bot Fetches /robots.txt First, Does That Mean It Will Crawl My Whole Site?

Not necessarily. robots.txt only tells a bot what it may crawl, and bots can still choose to crawl very little (or nothing) based on their own priorities and constraints. Also, robots rules are advisory and not access control. For indexing-related behavior, crawling does not guarantee inclusion. (See RFC 9309 and Google’s crawling/indexing FAQ.)

Why Do I See Repeated /robots.txt Requests Every Day?

Caching and refresh behavior is a common cause. For example, Google notes robots.txt is typically cached (often up to ~24 hours) and may be refetched sooner or later depending on errors and cache headers. If you deploy frequent robots changes, refetches can also increase.

Do OpenAI Or Anthropic Bots Always Obey robots.txt?

For their automatic crawlers, both publish guidance that site owners can use to manage access via robots.txt. However, OpenAI also describes ChatGPT-User as a user-initiated agent where “robots.txt rules may not apply.” If you need strict control, combine robots directives with edge controls and verification.

When Building A CustomGPT Agent, Should I Enter A URL Or A Sitemap?

If you have a clean sitemap that represents exactly what you want indexed, it’s usually the most controllable input. CustomGPT supports entering either a URL or sitemap when creating an agent. If no sitemap is found, CustomGPT may crawl from the homepage unless you change the “start crawling from the provided URL” option.

My Site Has No Sitemap. How Do I Keep CustomGPT From Crawling Unrelated Sections?

Use the “Not what you expected?” option to start crawling from the exact URL you entered rather than the homepage, and keep that URL in a tightly scoped section of your site. If you can, create a curated sitemap that lists only the pages you want indexed, then use that sitemap as the ingestion input.

3x productivity.
Cut costs in half.

Launch a custom AI agent in minutes.

Instantly access all your data.
Automate customer service.
Streamline employee training.
Accelerate research.
Gain customer insights.

Try 100% free. Cancel anytime.