ZeroTrace OSINT
Robots.txt & Sitemap
Crawl rules, disallowed paths, sitemap-index recursion, and the "interesting by omission" entropy view.
The robots.txt and sitemap tool fetches both files for a site and analyses them for paths the operator wanted crawlers to know about — both include (sitemap) and exclude (robots disallow).
For OSINT purposes, the disallow paths are often more interesting than the sitemap. They tell you which directories the operator considers sensitive enough to ask crawlers not to index.
What you get
| Section | What it surfaces |
|---|---|
| robots.txt rules | Per-user-agent rules: Allow, Disallow, Crawl-delay |
| Sitemap URLs | Sitemaps linked from robots.txt, plus standard locations |
| Sitemap entries | Every URL listed in the sitemap (or sitemaps, when recursing) |
| Sitemap-index recursion | Sitemap-index files unwrap into their child sitemaps automatically |
| lastmod histogram | When the sitemap entries were last updated, grouped by month |
| Diff against archived | Snapshot diff of robots.txt across Wayback captures |
Disallow paths — interesting by omission
A path that appears in Disallow: is a path the operator does not want indexed. Common reasons:
- Admin panels (
/admin,/wp-admin). - Private API endpoints (
/api/internal,/private). - Account areas (
/account,/dashboard,/profile). - Search results pages (
/search?). - Staging / preview (
/staging,/preview).
For reconnaissance, these paths are known to exist (otherwise the operator would not have written a rule for them) and known to be sensitive (otherwise the operator would not have asked for them to stay out of search engines).
The tool sorts disallow paths by uniqueness across the site — paths that appear in the disallow list but not in the sitemap are highlighted as "interesting by omission."
"Interesting by omission" is the OSINT investigator's friend. The page that the site does not advertise is often the page that matters most.
Sitemap recursion
A sitemap-index file lists other sitemap files. The tool detects index files and recurses into the children, returning the merged URL list with a source-sitemap column so you can see which sitemap each URL came from.
For very large sites (e-commerce catalogs, news archives, large CMSes), recursion can return tens of thousands of URLs. The tool paginates the result; CSV export gives you the full list.
lastmod histogram
The <lastmod> tag on each sitemap entry tells search engines when the page was last updated. Aggregating lastmod into a histogram tells you:
- Bursts of activity — periods when the site published heavily.
- Quiet periods — periods when the site was inactive.
- Recent changes — what the operator updated yesterday.
For investigative reporting, the recent changes column is the immediate value: "what has this site changed in the last week?"
Diff against archived robots.txt
A toggle pulls the most recent archived robots.txt from the Wayback Machine and diffs it against the live one. Useful for spotting:
- Newly added disallow paths (new sensitive areas).
- Removed disallow paths (sensitive areas opened up to indexing).
- Crawl-delay changes (signal of crawler-pressure changes).
Pivots
| Click on... | Pivot to |
|---|---|
| Disallow path | Web crawler (target the disallow path), site analysis |
| Sitemap URL | URL parser |
| Sitemap entry URL | Site analysis, redirect analyzer, Wayback |
| Crawl-delay value | (no pivot — informational) |
Pre-fetch quick view
For very large sitemaps, a "first 50 entries" quick view loads instantly while the full recursion runs in the background. You see something useful immediately without waiting for the full crawl.
Sources
- The site's own
robots.txtandsitemap.xml(and any other sitemap URLs they list). - The Wayback Machine for the archived-diff feature.
Every source is named on the result.