ZeroTrace OSINT
Web Crawler
Multi-page crawl with email, phone, external-domain extraction and per-page tech-stack hints.
The web crawler walks a site, page by page, and extracts the contact-and-context data scattered across its public pages. It is the broad-strokes reconnaissance tool — give it an apex URL, get back a structured digest of everything publicly visible.
What you get
For a configurable depth and page count, the crawler returns:
| Section | What it surfaces |
|---|---|
| Pages crawled | URL, status code, title, content length per page |
| Emails | Every email address discovered across the crawl, deduped |
| Phone numbers | Every phone number discovered, with E.164 normalisation |
| External domains | Every external host the site links to, with link counts |
| Form actions | Every form's action URL — endpoints that accept user input |
| Per-page tech-stack hints | A lightweight version of site analysis per page, surfacing CMS / framework changes across the site |
| Sitemap auto-seed | The crawl seeds itself with the URLs found in the site's sitemap (auto-composed via the robots/sitemap tool) |
Configuration
Inputs:
- Start URL — the page the crawl begins from.
- Max pages — hard cap on pages visited. Default 100.
- Max depth — hard cap on link-distance from the start URL. Default 3.
- Same-host only — restrict to the start URL's host (recommended). Default on.
- Wordlist seed — optional list of additional paths to try (
/admin,/api/v1, etc.). - Crawl delay — politeness delay between requests. Defaults to a courteous value.
The crawler respects robots.txt by default. A toggle disables this, but you should not enable it without a clear authorised reason.
The web crawler generates HTTP traffic against the target. Use it only on sites where you have authorisation, or sites that are openly published for public consumption. Aggressive crawling can be misread as an attack.
Email and phone extraction
Every page is scanned for:
- Email addresses matching standard patterns, plus common obfuscations (
name [at] example [dot] com). - Phone numbers matching international and national patterns, normalised to E.164 where possible.
Deduplication is automatic. The result table shows the count of pages each email / phone appeared on, so the most-mentioned contact rises to the top.
Form actions
Every <form> element on every page contributes its action attribute to the form-actions list. This tells you:
- Endpoints that accept user input — login forms, search forms, contact forms, upload forms, comment forms.
- Cross-origin form submissions — forms that POST to a different origin (often legitimate third-party services, sometimes a misconfiguration).
For authorised security testing, the form-actions list is the input to web-application-test planning.
Secrets in JS
The crawler optionally scans inline and linked JavaScript for common secret patterns:
- AWS access keys (
AKIA...). - Google Cloud API keys.
- Slack tokens (
xoxb-,xoxp-). - Stripe keys.
- Generic API-key-shaped strings.
A match is a finding to investigate, not a confirmed leak — many matches are placeholders or test keys. But every real secret leak in a public JS bundle started as a pattern hit somewhere.
Per-page tech-stack hints
Each page contributes its own lightweight tech-stack fingerprint to the crawl. Useful for spotting:
- Stack discontinuities — a primary site running on WordPress with one section running on a Django subdomain.
- Multiple CMSes on the same domain — often migration artefacts.
- Embedded third-party tools — admin areas, support widgets, embedded apps.
For full per-page detail, pivot to site analysis.
Pivots
| Click on... | Pivot to |
|---|---|
| Page URL | Site analysis, redirect analyzer, Wayback |
| Email analyzer, password breach lookup, person investigation | |
| Phone | Phone lookup |
| External domain | DNS, WHOIS, certificate transparency, site analysis |
| Form action URL | URL parser, redirect analyzer |
| JS secret pattern | (no pivot — copy and verify externally) |
Sources
- Direct HTTP requests to each crawled URL (rate-limited, robots.txt-honouring by default).
- The site's own
sitemap.xmlfor auto-seed. - A bundled secret-pattern catalog for the JS scan.
The crawler does not call any external API — it only fetches the target site itself.