Limiting crawls

The crawl endpoint can be limited in two ways: you can limit the numbe of pages visited, and the depth (or distance) from the starting URLs.

Limiting a crawl will control costs. Each visit uses bandwidth which incurs a cost depending on the proxy you use. There is also a small per-visit charge from FetchFox. A cap on the numbe of visits will also cap your cost.

Limiting crawls also affects runtime. Although FetchFox visits pages concurrently during a crawl, there is a cap on that concurrency. A crawl with thousands of visits will take longer to execute than one with hundreds or fewer.

Depending on the number of URLs you want to find, you may not need to visit very many pages in a crawl.

Limit the number of visits

The max_visits parameter puts a limit on the number of pages visited in the crawl. The crawl will never visit more than that many URLs.

Below is an example of a call to the crawl endpoint with a limit.

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/*",
    "max_visits": 50
}'

This crawl will never visit more than the specified max pages.

Limit the depth

The max_depth parameter limits the depth of the crawl.

Depth means how far FetchFox will go from the starting URLs. The starting URLs are determined in one of two ways.

If you pass in an array of URLs in start_urls, those are considered the staring URLs.
If you do not pass in start_urls, then FetchFox will automatically set the starting URLs

With that in mind, suppose a crawl starts at https://example.com/, and it finds a link to https://example.com/category. Following that link brings us to a depth of 1. If there is a link from there to “https://example.com/category/product`, that would be a depth of 2, and so on.

If there are multiple links to a page, the lowest depth value is used. All the starting URLs are considered to have a depth of 0.

Below is an example of a crawl that limits the maximum depth.

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/*",
    "start_urls": [
      "https://pokemondb.net/pokedex/national"
    ],
    "max_depth": 1
}'

This crawl will not travel beyond the specified depth from the starting URLs.

Get Started

Basics

Crawl

Limit the number of visits

Limit the depth

Get Started

Basics

Crawl

​Limit the number of visits

​Limit the depth

Limit the number of visits

Limit the depth