Skip to main content

Crawler Basics

How it Works

The crawler will automatically crawl your website, following links found that match a certain pattern. By default, a pattern is created for you based on the starting link you provide. For example, if your provided:

https://help.example.com/faqs

The crawler will only follow links that start with https://help.example.com/faqs like:

  • https://help.example.com/faqs/login
  • https://help.example.com/faqs/password
  • https://help.example.com/faqs/input/index.html

But not:

  • https://help.example.com
  • https://www.example.com
  • https://search-engine.com

Picking a Good Starting Page

Given how the crawler works, you want to find a starting URL that has links on the page to other subdirectories and pages. For example:

  • https://support.example.com A subdirectory to the main website that contains all the desired content.
  • https://example.com/help A path off the main website that contains all the desired content.

Support for Schema.org FAQPage

If the crawler finds embedded schema.org FAQPage type, it will parse it and automatically add the FAQs to the Answers content.

Whitelist Patterns

By default, Studio will set a pattern that will crawl a website, within the same domain. The pattern generated for starting page:

https://www.example.com

is

https://[(www.)?]example.com[.+]

The patterns use a special notation, see the [] above, for defining which websites should be visited. This notation is used by other web crawlers such as Apify, their documentation on the pattern can be found here.

The notation allows you to put regular expressions inside the []. Looking back at the above example, the (www.)? within the brackets means the www. is optional and the .+ means anything beyond the initial example.com is acceptable.