Skip to main content

Web Crawler CLI

There are some instances where you may need to crawl content from your computer. The same crawler we use in Studio is also available as a Node.js based CLI.

We recommend installing it globally:

npm install -g @xapp/arachne

Example Usage

To crawl a site and save the pages to a local ./temp directory

arachne crawl https://documentation.xapp.ai/ -d ./temp

To also save markdown and schema.org FAQs

arachne  crawl https://documentation.xapp.ai/ -a -t markdown -d ./temp

With a whitelisted patterns file

arachne  crawl http://www.thecoffeefaq.com/ -a -t markdown -d ./temp -w ./temp/whitelist.md

Common Issues

Running on WSL2

If attempting to run on the Windows Subsystem for Linus (WSL 2) and either the crawler fails to start or provides the following error message:

Unable to start the crawler.
TimeoutError: Timed out after 5000 ms while trying to connect to the browser! Only Chrome at revision r982053 is guaranteed to work.
at Timeout.onTimeout (/home/mycul/.nvm/versions/node/v12.18.4/lib/node_modules/@xapp/arachne-cli/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:252:20)
at listOnTimeout (internal/timers.js:549:17)
at processTimers (internal/timers.js:492:7)

you may need to take a few extra steps to allow the crawler to find Chrome. Please see the instructions as outlined here. When you start VcXsrv, select "Multiple Windows", "Start no client", and "Disable access control".