Crawl using URL patterns

FetchFox’s crawl endpoint takes a URL pattern as input, and it returns a list of URLs matching that pattern. It does this by repeatedly visiting pages on the target domain, finding new URLs to visit, and returning the ones that match. The endpoint has a variety of parameters to control how the crawl is executed.

Using URL patterns

A basic crawl takes just the pattern parameter as input, which is a URL pattern.

A URL pattern is a string that may contain some * and :* operators. Both of these are wildcards, but they operate in slightly different ways.

* matches any character
:* matches any character except /

URL patterns must be valid URLs with a domain. The domain may not contain wildcards.

Below are a few examples of URL patterns and what they match.

URL Pattern	Matches	Does not match
https://example.com/*	https://example.com/c/page-1	https://othersite.com/page
https://example.com/a/*	https://example.com/a/b/page	https://example.com/c/page-1
https://example.com/a/:*	https://example.com/a/page	https://example.com/a/b/page
https://example.com/a/page-1	https://example.com/a/page-1	https://example.com/a/page-2

To execute a simple crawl with a URL pattern, pass it in as a parameter to the crawl endpoint. This is shown in the example call below.

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/*",
    "max_visits": 50
}'

The response for this call will look something like this:

{
  "job_id": "5ooygvit1y",
  "results": {
    "hits": [
      "https://pokemondb.net/pokedex/all",
      "https://pokemondb.net/pokedex/archaludon",
      "https://pokemondb.net/pokedex/charizard",
      "https://pokemondb.net/pokedex/corviknight",
      "https://pokemondb.net/pokedex/dipplin",
      "https://pokemondb.net/pokedex/dragapult",
      "https://pokemondb.net/pokedex/dragonite",
      "https://pokemondb.net/pokedex/eevee",
      "https://pokemondb.net/pokedex/game/legends-arceus",
      "https://pokemondb.net/pokedex/game/scarlet-violet",
      ...more results...
    ]
  },
  "metrics": { ...cost and usage metrics ... }
}

The results.hits section contains all the matching URLs.

In the example above, you’ll notice some unwanted URLs, like https://pokemondb.net/pokedex/game/legends-arceus. We were looking only for URLs matching Pokemon characters, not games. This is a good time to use the :* operator, which does not match slashes. Our updated call looks like this:

curl -X POST https://api.fetchfox.ai/api/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
    "pattern":"https://pokemondb.net/pokedex/:*",
    "max_visits": 50
}'

This will exclude the unwanted URLs.

Get Started

Basics

Crawl

Crawl using URL patterns

Using URL patterns

Get Started

Basics

Crawl

​Using URL patterns

Using URL patterns