Crawl using URL patterns
The crawl endpoint finds URLs matching a pattern
FetchFox’s crawl endpoint takes a URL pattern as input, and it returns a list of URLs matching that pattern. It does this by repeatedly visiting pages on the target domain, finding new URLs to visit, and returning the ones that match. The endpoint has a variety of parameters to control how the crawl is executed.
Using URL patterns
A basic crawl takes just the pattern
parameter as input, which is a URL pattern.
A URL pattern is a string that may contain some *
and :*
operators. Both of these are wildcards, but they operate in slightly different ways.
*
matches any character:*
matches any character except/
URL patterns must be valid URLs with a domain. The domain may not contain wildcards.
Below are a few examples of URL patterns and what they match.
To execute a simple crawl with a URL pattern, pass it in as a parameter to the crawl endpoint. This is shown in the example call below.
The response for this call will look something like this:
The results.hits
section contains all the matching URLs.
In the example above, you’ll notice some unwanted URLs, like https://pokemondb.net/pokedex/game/legends-arceus
. We were looking only for URLs matching Pokemon characters, not games. This is a good time to use the :*
operator, which does not match slashes. Our updated call looks like this:
This will exclude the unwanted URLs.