When running a report, the Insites spider will pick the pages it downloads based on our proprietary prioritisation algorithm. The exact way this works is complicated and commercially sensitive, however, we can share some high-level details that can help you understand why it picks the pages it does.
Sources
The spider is “seeded” with the homepage provided to trigger the audit, as well as any pages in the XML sitemap (if there is one).
Each time a page is downloaded, the URLs found on that page (for example in links) are added to the list of potential pages to check.
Factors used in prioritisation
When Insites picks the next page to check, here are some of the factors it considers:
How did we find the URL? (sitemap, link from another page etc)
If we found the URL on another page, how far up the page did we find it?
How often have we seen that URL on the pages we already downloaded?
Is the page a “priority page” (see below)
Is the page on a subdomain - e.g. blog.mysite.com (when the setting to crawl subdomains is enabled)
Priority pages
In order to obtain a “fair” report, the spider will try and check at least one page that appears to be the following:
A blog post
A contact page
A product page
A “service” page
A page from each subdomain that is discovered (when the setting to crawl subdomains is enabled)
Conversely, Insites will avoid checking pages of terms (e.g. terms and conditions or a privacy policy). Note that these are only biases – if there are only 5 pages on the site and one of those is a terms page, then the terms page will be checked as part of the audit.
Ignoring and including specific pages
By clicking the cog on the top right, when viewing a report and accessing the report settings, you can add specific paths that you either want the platform to avoid or include within the scan, by going to the website section of the report settings