Skip to content

ZeroTrace OSINT

Robots.txt & Sitemap

Crawl rules, disallowed paths, sitemap-index recursion, and the "interesting by omission" entropy view.

The robots.txt and sitemap tool fetches both files for a site and analyses them for paths the operator wanted crawlers to know about — both include (sitemap) and exclude (robots disallow).

For OSINT purposes, the disallow paths are often more interesting than the sitemap. They tell you which directories the operator considers sensitive enough to ask crawlers not to index.

What you get

SectionWhat it surfaces
robots.txt rulesPer-user-agent rules: Allow, Disallow, Crawl-delay
Sitemap URLsSitemaps linked from robots.txt, plus standard locations
Sitemap entriesEvery URL listed in the sitemap (or sitemaps, when recursing)
Sitemap-index recursionSitemap-index files unwrap into their child sitemaps automatically
lastmod histogramWhen the sitemap entries were last updated, grouped by month
Diff against archivedSnapshot diff of robots.txt across Wayback captures

Disallow paths — interesting by omission

A path that appears in Disallow: is a path the operator does not want indexed. Common reasons:

  • Admin panels (/admin, /wp-admin).
  • Private API endpoints (/api/internal, /private).
  • Account areas (/account, /dashboard, /profile).
  • Search results pages (/search?).
  • Staging / preview (/staging, /preview).

For reconnaissance, these paths are known to exist (otherwise the operator would not have written a rule for them) and known to be sensitive (otherwise the operator would not have asked for them to stay out of search engines).

The tool sorts disallow paths by uniqueness across the site — paths that appear in the disallow list but not in the sitemap are highlighted as "interesting by omission."

"Interesting by omission" is the OSINT investigator's friend. The page that the site does not advertise is often the page that matters most.

Sitemap recursion

A sitemap-index file lists other sitemap files. The tool detects index files and recurses into the children, returning the merged URL list with a source-sitemap column so you can see which sitemap each URL came from.

For very large sites (e-commerce catalogs, news archives, large CMSes), recursion can return tens of thousands of URLs. The tool paginates the result; CSV export gives you the full list.

lastmod histogram

The <lastmod> tag on each sitemap entry tells search engines when the page was last updated. Aggregating lastmod into a histogram tells you:

  • Bursts of activity — periods when the site published heavily.
  • Quiet periods — periods when the site was inactive.
  • Recent changes — what the operator updated yesterday.

For investigative reporting, the recent changes column is the immediate value: "what has this site changed in the last week?"

Diff against archived robots.txt

A toggle pulls the most recent archived robots.txt from the Wayback Machine and diffs it against the live one. Useful for spotting:

  • Newly added disallow paths (new sensitive areas).
  • Removed disallow paths (sensitive areas opened up to indexing).
  • Crawl-delay changes (signal of crawler-pressure changes).

Pivots

Click on...Pivot to
Disallow pathWeb crawler (target the disallow path), site analysis
Sitemap URLURL parser
Sitemap entry URLSite analysis, redirect analyzer, Wayback
Crawl-delay value(no pivot — informational)

Pre-fetch quick view

For very large sitemaps, a "first 50 entries" quick view loads instantly while the full recursion runs in the background. You see something useful immediately without waiting for the full crawl.

Sources

  • The site's own robots.txt and sitemap.xml (and any other sitemap URLs they list).
  • The Wayback Machine for the archived-diff feature.

Every source is named on the result.

Command Palette

Search for a command to run...