Many well-behaved crawlers fetch /robots.txt early because it’s the standard place to learn where they’re allowed to crawl and (often) where to find a site’s sitemap. For example, Google states its crawlers download and parse robots.txt before crawling to determine what may be crawled.
That said, “AI crawler” isn’t one behavior: some operators run multiple bots for search, training, and user-requested retrieval, each with different rules and controls.
Try CustomGPT with the 7-day free trial to validate ingestion.
TL;DR
AI crawlers often request /robots.txt first to quickly determine crawl permissions (REP) and discover high-value entry points like sitemaps, before spending requests on content URLs. Check robots.txt, sitemap, and rate limits to reduce unwanted crawling.What “Visit First” Means
In this article, “visit first” means: the first HTTP request you observe for a crawler identity (user-agent and/or IP) in your logs, based on the earliest timestamp. It does not mean “first indexed” or “first used in an AI answer.” Crawling and indexing are separate processes, and indexing isn’t guaranteed even if a URL is crawled.Why /robots.txt Is A Common First Request
1) It Answers “Am I Allowed To Crawl Here?”
The Robots Exclusion Protocol (REP) defines how site owners publish crawler rules in a file named robots.txt (commonly accessed at /robots.txt) and makes clear that these rules are not access authorization, they’re advisory guidance for crawlers.2) It’s A Fast, Low-Cost “Rules Check” Before Content Fetches
Fetching a small text file first is cheaper than testing many pages and then discovering they were disallowed. For Google specifically, the robots file is fetched via a non-conditional GET and the rules apply only to the specific host/protocol/port where it’s served.The Other Common “First URLs”
A Sitemap
Many crawlers will look for sitemaps because they list URLs efficiently and reduce guesswork. The sitemap protocol is defined at (sitemaps.org protocol), and Google’s robots spec supports declaring one or more sitemap URLs via Sitemap: lines in robots.txt.A Seed URL
If a crawler doesn’t find or use a sitemap, it may start from a known seed (commonly the homepage) and expand via internal links. Important qualifier: If you see the homepage as the first hit, it can mean the bot already had robots rules cached, already knew a seed URL, or didn’t (or couldn’t) use a sitemap.When /robots.txt Is Not The First Hit
Two common reasons:- Caching: Some crawlers cache robots.txt and won’t refetch it on every crawl. Google notes robots.txt is generally cached (often up to 24 hours, sometimes longer depending on conditions).
- User-Triggered Fetchers: Some bots fetch pages because a user asked, not because the bot is doing an automatic crawl. OpenAI notes ChatGPT-User may visit pages for user actions and that “robots.txt rules may not apply” in that scenario, Google similarly distinguishes cases where REP isn’t applicable to certain user-controlled crawlers.
How To Verify What An AI Crawler Visited First
- Pull raw server/CDN/WAF logs for the time window you care about.
- Filter by user-agent (and IPs, if you have operator-published ranges).
- Sort by timestamp ascending.
- Record the first URL path and HTTP status.
- Trace the next 20–100 requests to see whether it moved to:
- /robots.txt
- a sitemap (from Sitemap: in robots.txt or a known sitemap location)
- the homepage
- specific content paths
How To Influence What Crawlers Do First
Serve A Real robots.txt
- Ensure /robots.txt exists and returns a clean response (avoid HTML error pages).
- Keep rules readable and correct; robots is guidance, not security.
Publish A Sitemap And Reference It
- Generate a sitemap that lists canonical, preferred URLs (sitemaps.org protocol).
- Add one or more Sitemap: lines in robots.txt so crawlers can discover it quickly.
Remove Crawl Traps Early
Common crawl traps that waste early requests:- infinite faceted navigation
- calendars with unbounded pages
- session IDs / parameter permutations
- duplicate URL variants (http/https, www/non-www, trailing slash variants)
Rate Limit Carefully
If load is the issue, apply rate limits at the edge after deciding which bots you want:- Blocking robots.txt can be counterproductive: some crawlers change behavior depending on robots status codes. Google documents different handling for 2xx/3xx/4xx/5xx responses.
Crawl-Delay Is Not Universal
- Some operators support Crawl-delay (non-standard). Anthropic explicitly states it supports crawl-delay for its bots.
- Google’s robots spec notes fields like crawl-delay aren’t supported by Google’s crawlers.
How This Relates To CustomGPT.ai Website Ingestion
If your goal is to control what CustomGPT indexes when you build an agent from your website, you can provide a URL or sitemap and adjust where crawling begins.- To create an agent from a website URL or sitemap, follow: Create AI Agent From Website.
- If no sitemap is found, CustomGPT defaults to recursive crawling from the homepage of the provided domain.
- If you want it to start from the exact URL you entered, use the “Not what you expected?” option to select “Start crawling from the provided URL rather than the home page”.
Example: Explaining A “First Visit” Sequence In Logs
If your logs show:- 09:00:01 GET /robots.txt 200
- 09:00:02 GET /sitemap.xml 200
- 09:00:04 GET / 200
- 09:00:06 GET /docs/ 200
- 09:00:08 GET /docs/getting-started 200
- The bot started with policy discovery (robots.txt) and then URL discovery (sitemap).
- Then it fetched hub pages to expand discovery and prioritization.