← Back to Writing

Decent Crawling: A Proposal for Respectful Web Indexing

The current state of web crawling is wasteful. Search engines, AI training pipelines, and other automated agents repeatedly fetch entire websites on aggressive schedules — regardless of whether anything has changed. This puts unnecessary load on servers, wastes bandwidth, and disrespects the implicit contract between publishers and crawlers.

It doesn’t have to be this way. What follows is a simple, practical proposal for decent crawling: a set of principles that dramatically reduce unnecessary requests while keeping indexes fresh.

1. Respect robots.txt — actually

Every crawler claims to honor robots.txt, but decent crawling starts by treating it as more than just a list of disallowed paths. The file often contains Sitemap: directives pointing to one or more XML sitemaps. These are an explicit invitation from the site owner: here is what I publish, and here is how to find it. Use them.

If robots.txt does not declare a sitemap, we can fall back to well-known conventions before resorting to blind crawling:

  • /sitemap.xml
  • /sitemap_index.xml
  • /sitemaps/sitemap.xml

Only if none of these exist should a crawler consider broader discovery strategies — and even then, it should do so conservatively.

2. Poll the sitemap, not the site

Rather than re-crawling thousands of pages to check for changes, a decent crawler periodically fetches only the sitemap. This is a single, small request that provides a complete picture of a site’s published content. Sitemaps typically include a <lastmod> timestamp for each URL, making change detection trivial.

The polling interval for the sitemap itself can be generous — once a day is sufficient for most sites, and many would be well-served by even less frequent checks.

3. Crawl only what changed

With a sitemap in hand, the crawler compares the current version to the previous one and identifies:

  • New URLs — pages that have been added since the last check.
  • Modified URLs — pages where <lastmod> has been updated.
  • Removed URLs — pages that are no longer listed and can be de-indexed.

Only new and modified URLs need to be fetched. Everything else stays as-is.

The impact

For a site with 10,000 pages where 50 change per day, this approach reduces daily crawl requests from 10,000+ to roughly 50 — plus one sitemap fetch. That’s a reduction of over 99%.

Multiply this across millions of sites and the savings in server load, energy, and bandwidth become enormous.

A call to action

This isn’t a new protocol. It doesn’t require new standards or infrastructure. Everything described here already exists — robots.txt, XML sitemaps, <lastmod> timestamps — and has for years. The proposal is simply that crawlers should use what’s already there, rather than brute-forcing their way through the web on repeat.

Decent crawling is a choice. It’s time more of us made it.