Scraping with FetchFox has two phases: crawl and extract
There are two phases to scraping with FetchFox. First, you crawl for URLs that match a URL pattern. Then, you extract data from the URLs that matched your pattern.
Let’s walk through an example, step-by-step.
You find URLs that match a pattern using the FetchFox’s crawl endpoint. This endpoint has one required parameter, which pattern
. This is a URL that may contain some * characters in it, which act as wildcards. The crawl endpoint will find and return URLs that match that pattern.
Some examples of URL patterns:
The pattern
field is the only required parameters for a crawl.
Below is an example of how to call the crawl endpoint.
This call will take about a minute, and it finds all URLs starting with https://pokemondb.net/pokedex/
. The goal is to get the URLs of all the pokemon. The max_visits
parameter puts a cap on how many pages FetchFox visits before it stops crawling.
The response will look something like this:
This .results.hits
section contains all the URLs matching the URL pattern https://pokemondb.net/pokedex/*
.
You might have noticed a few unwanted URLs in the hits, like https://pokemondb.net/pokedex/game/legends-arceus
. We can update our pattern to exclude these using the :*
operator. This will match all characters except for forward slash.
The extract endpoint takes URLs and converts them into structured data. It has two required parameters: a list of urls
specifying the pages to visit, and an item template
which describes the output format.
The template
parameter can be either a string or a dictionary. If it is a string, FetchFox will use that string and a sampling of the URLs you pass in to determine the item schema. If it is a dictionary, the output items will always have keys matching the dictionary you provided.
In either case, FetchFox will create a JSON schema specification from the template
parameter. All items will follow the same JSON schema, and the response from the endpoint will include the schema.
Below is an example call to extract data about some Pokemon.
This should take under a minute to run, and the result will look something like this:
The output item matches our template and reflects the data from the source page. FetchFox also added two fields, _url
and _htmlUrl
. These tell you the URL of the source data, and a saved snapshot of the HTML that FetchFox saw. These are useful for debugging.
By default, the extract endpoint finds exactly one item per page. Sometimes you want to extract multiple items per page. To do this, set the per_page
parameter to many
.
Here is an example showing how to extract all the Pokemon from https://pokemondb.net/pokedex/national
:
The response will look something like this:
The response artifacts
will include the CSS selector used to divide the single page into pieces. Each piece is used to generate one output item.
It’s a common pattern to do a crawl and feed the results into extract. To make this simpler, FetchFox provides a single endpoint to do both at once: the scrape endpoint.
The scrape endpoint has two required parameters: pattern
which is used in the crawl phase to find URLs, and template
, which is used in the extract phase to output items.
Below is an example call to the scrape endpoint.
Note a couple points in the call:
:*
wildcard in the pattern, which does not match forward slash. If you do want to match forward slash, use *
max_extracts
to 10. The limits the number of extractions, which is important to control costs during tests.This call should take a minute or two to run, and the output will be similar to the extract call from before.
Scraping with FetchFox has two phases: crawl and extract
There are two phases to scraping with FetchFox. First, you crawl for URLs that match a URL pattern. Then, you extract data from the URLs that matched your pattern.
Let’s walk through an example, step-by-step.
You find URLs that match a pattern using the FetchFox’s crawl endpoint. This endpoint has one required parameter, which pattern
. This is a URL that may contain some * characters in it, which act as wildcards. The crawl endpoint will find and return URLs that match that pattern.
Some examples of URL patterns:
The pattern
field is the only required parameters for a crawl.
Below is an example of how to call the crawl endpoint.
This call will take about a minute, and it finds all URLs starting with https://pokemondb.net/pokedex/
. The goal is to get the URLs of all the pokemon. The max_visits
parameter puts a cap on how many pages FetchFox visits before it stops crawling.
The response will look something like this:
This .results.hits
section contains all the URLs matching the URL pattern https://pokemondb.net/pokedex/*
.
You might have noticed a few unwanted URLs in the hits, like https://pokemondb.net/pokedex/game/legends-arceus
. We can update our pattern to exclude these using the :*
operator. This will match all characters except for forward slash.
The extract endpoint takes URLs and converts them into structured data. It has two required parameters: a list of urls
specifying the pages to visit, and an item template
which describes the output format.
The template
parameter can be either a string or a dictionary. If it is a string, FetchFox will use that string and a sampling of the URLs you pass in to determine the item schema. If it is a dictionary, the output items will always have keys matching the dictionary you provided.
In either case, FetchFox will create a JSON schema specification from the template
parameter. All items will follow the same JSON schema, and the response from the endpoint will include the schema.
Below is an example call to extract data about some Pokemon.
This should take under a minute to run, and the result will look something like this:
The output item matches our template and reflects the data from the source page. FetchFox also added two fields, _url
and _htmlUrl
. These tell you the URL of the source data, and a saved snapshot of the HTML that FetchFox saw. These are useful for debugging.
By default, the extract endpoint finds exactly one item per page. Sometimes you want to extract multiple items per page. To do this, set the per_page
parameter to many
.
Here is an example showing how to extract all the Pokemon from https://pokemondb.net/pokedex/national
:
The response will look something like this:
The response artifacts
will include the CSS selector used to divide the single page into pieces. Each piece is used to generate one output item.
It’s a common pattern to do a crawl and feed the results into extract. To make this simpler, FetchFox provides a single endpoint to do both at once: the scrape endpoint.
The scrape endpoint has two required parameters: pattern
which is used in the crawl phase to find URLs, and template
, which is used in the extract phase to output items.
Below is an example call to the scrape endpoint.
Note a couple points in the call:
:*
wildcard in the pattern, which does not match forward slash. If you do want to match forward slash, use *
max_extracts
to 10. The limits the number of extractions, which is important to control costs during tests.This call should take a minute or two to run, and the output will be similar to the extract call from before.