How the Crawler Works

Daylytix crawls your website the same way a search engine bot does — systematically following links from page to page and inspecting every resource it finds. This article explains exactly how the crawl engine works, what data it collects on each page, and how you can configure it to suit your site's size and structure.

Breadth-First Search algorithm

The crawler uses a Breadth-First Search (BFS) algorithm. Starting from the root URL you provide, it discovers and queues all internal links found on that page, then processes each of those pages in turn — layer by layer — before going deeper. This ensures that the most important, highest-authority pages (those closest to the homepage) are always crawled first, and that shallow pages are prioritised over deep ones when the crawl page limit is reached.

The BFS queue works as follows:

Root URL is added to the queue at depth 0.
All internal <a href> links on the root page are extracted and added to the queue at depth 1.
Each page at depth 1 is fetched; their links are added at depth 2.
This continues until the configured max depth or max pages limit is hit.

✓

Tip: For most sites, the default depth of 3 is sufficient. Large e-commerce stores or sites with deep category trees may benefit from increasing depth to 4 or 5.

Link discovery

On each crawled page, the engine parses the full HTML document and extracts all <a href="..."> tags. Links are then categorised:

Internal links — same domain or subdomain as the root URL. These are added to the crawl queue.
External links — pointing to a different domain. These are recorded and audited for broken status but are never followed.
Fragment links (#section) and javascript: hrefs are ignored.
Duplicate URLs are deduplicated automatically — each URL is crawled only once.

URL normalisation is applied before deduplication: trailing slashes are handled according to your trailing-slash consistency preference, and query strings are preserved unless they are common tracking parameters (utm_source, utm_medium, etc.), which are stripped to avoid crawling the same page multiple times.

robots.txt compliance

Before adding any URL to the crawl queue, the crawler checks the site's robots.txt file. Any URL that matches a Disallow rule for the * or Daylytix user-agent is skipped and flagged in the audit results as "blocked by robots.txt".

The robots.txt file is fetched once at the start of the crawl and cached for the entire session. If the file is unreachable (non-200 response), the crawler proceeds as if no disallow rules exist and flags a warning in the technical issues section.

⚠

Note: Daylytix respects your robots.txt rules — URLs blocked by robots.txt will not appear in your crawl results. If you want to audit a staging site where robots.txt blocks all bots, temporarily allow the Daylytix user-agent or use the "Ignore robots.txt" override in advanced crawl settings.

Configurable crawl settings

You can customise the crawl behaviour from the audit start screen or within a monitoring schedule. The following settings are available:

Setting	Default	Description
Start URL	—	The root page the crawler begins from. Usually your homepage.
Max depth	3	Maximum number of hops from the root URL. A depth of 3 means the crawler will follow links 3 levels deep from the homepage.
Max pages	500	Hard cap on total pages crawled. The crawl stops once this limit is reached, prioritising shallower pages first.
Crawl rate	2–5 req/s	Polite delay between requests to avoid overloading the server. Jitter is added to mimic natural browsing patterns.
Ignore robots.txt	Off	When enabled, disallow rules in robots.txt are bypassed. Use only on sites you own.
Follow subdomains	Off	When enabled, links to subdomains (e.g. blog.example.com) are crawled as internal pages.
Strip query strings	Tracking params only	Controls how URL parameters are handled during deduplication.
User-Agent	Daylytix/1.0	The user-agent string sent with each request. Can be customised on Agency plans.

Data collected per page

For each URL successfully crawled, Daylytix stores a rich snapshot of on-page data used throughout the audit checks:

On-page content

Title tag — text content, character length, pixel-estimated width
Meta description — text content and length
H1–H6 headings — full text of each heading, including hierarchy order
Word count — total visible text words, excluding nav/footer boilerplate where detectable
Canonical tag — the rel="canonical" href if present
Open Graph & Twitter Card tags — og:title, og:description, og:image, twitter:card, and all variants
Schema / JSON-LD — all structured data blocks extracted and typed (Article, Product, FAQPage, etc.)
Hreflang tags — all language alternates declared on the page

Technical signals

HTTP status code — 200, 301, 302, 404, 500, etc.
Response time (ms) — time from request sent to first byte received
Content-Type header
Images — src, alt text (or absence), width/height attributes, loading attribute
Internal links on page — anchor text, href, follow/nofollow
External links on page — href, follow/nofollow, status code (checked asynchronously)
Page depth — number of hops from the root URL

Soft-404 detection

A soft-404 is a page that returns an HTTP 200 OK status but actually shows a "not found" or error message to the user. These are problematic for SEO because search engines may index empty or unhelpful pages.

Daylytix detects soft-404s using two signals:

Thin content — pages with fewer than 50 words of visible body text. These are flagged as potential soft-404s unless they are known template types (e.g. contact pages with forms).
Error phrase detection — if the page body contains phrases such as "page not found", "404", "this page does not exist", "sorry, we couldn't find", etc., it is flagged as a confirmed soft-404 regardless of the HTTP status code.

Soft-404s are reported in the Technical SEO section of your audit with the affected URLs listed.

Real-time crawl progress

Daylytix uses Server-Sent Events (SSE) to stream crawl progress to your browser in real time — no page refresh needed. The progress bar and status messages update live as the crawl runs.

Two types of SSE events are emitted during an audit:

Event type	Payload fields	What it means
`crawl`	`crawled`, `total`	Emitted after each batch of pages is fetched. Updates the crawl progress bar with pages crawled vs. total discovered.
`check`	`key`, `issues`, `passes`, `crawled`	Emitted when each audit check completes (e.g. "security", "on_page", "accessibility"). Shows how many issues and passes that check found.
`building`	—	Emitted when the crawl is complete and post-crawl analysis is running (accessibility scoring, schema recommendations, topical map, etc.).
`done`	Full audit `data` JSON	Emitted when the entire audit is finished. The dashboard switches from progress view to results view automatically.
`error`	`msg`	Emitted if an unrecoverable error occurs. The error message is shown in the dashboard.

ℹ

Note: If you close the browser tab during a crawl, the audit continues running on the server. You can return to the dashboard later and load the completed audit from your audit history.

Post-crawl analysis

Once all pages have been fetched, Daylytix runs a series of post-crawl checks that analyse the full dataset rather than individual pages. These run during the "building" phase and typically take 5–15 seconds depending on site size:

Accessibility scoring — analyses alt text coverage, anchor quality, heading hierarchy violations across all pages and produces a 0–100 score.
Schema recommendations — compares each page's detected type against its existing structured data and recommends missing schema types (BlogPosting, Product, FAQPage, etc.).
Internal linking suggestions — identifies underlinked pages (receiving fewer internal links than the site average) and suggests specific source pages and anchor text to add.
Topical authority map — clusters pages into content topics using keyword overlap and URL structure, then scores each cluster for depth, breadth, and authority.
Content decay — cross-references crawl data with historical GSC click data to flag pages that had strong traffic but are declining.
Duplicate content detection — compares titles and content fingerprints across all crawled pages to surface near-duplicate or identical pages.

Crawl speed and server courtesy

Daylytix is designed to be a polite crawler. Requests are sent at 2–5 per second with randomised delays between batches. This rate is intentionally conservative to avoid triggering rate-limiting or impacting your server's performance during a live audit.

For a site with 500 pages, a typical full crawl completes in 2–4 minutes. The post-crawl analysis phase adds another 10–30 seconds. Total audit time including all API-based checks (PageSpeed Insights, GSC, GA4) is typically under 5 minutes for most sites.

Breadth-First Search algorithm

Link discovery

robots.txt compliance

Configurable crawl settings

Data collected per page

On-page content

Technical signals

Soft-404 detection

Real-time crawl progress

Post-crawl analysis

Crawl speed and server courtesy

Was this article helpful?