Crawler documentation

The purpose of this document is to give you a deeper insight on how the SEO4Ajax service works. It details the behavior of the system and notably will help you to have a better understanding of how the API and advanced options work.

Overview

The crawler is the main component of the service and can be compared to a conductor, it takes care of capturing all pages from a site and keep these captures updated. The page capturing process is delegated to Chrome. To ensure the crawler can handle captures in parallel, multiple headless browsers run in a pool which can be scaled or reduced on demand. The pages captured by headless browsers are saved in a cache which is used to send captures to bots as soon as possible. Capture contents are also parsed by the crawler in order to find inner links. This analysis helps it to discover and capture all the pages of the site.

How does the crawling process work?

Each time a path is being processed, the crawler updates in the database the state of the path ensuring the crawling process goes smoothly. Here are the possible states of a path:

pending: When a page capture is asked through the contextual menu of the console, the public API, the authenticated API, or the crawler itself, the crawler updates its state to "pending" and sets the capture priority (see below). Note that the state will be updated only if the path is new or expired and not ignored. The expiration period is defined and can be disabled via the "page expiration period and" "capture expired pages" options in the site settings.
processing: Each time a new slot is available in the headless browser pool, the crawler selects one of the paths with "pending" state. This path state is set to "processing" and the available headless browser begins to capture the page. The pending and processing paths are displayed in the pendings view.
processed: When a page is correctly captured and saved in the cache, the crawler updates its state to "processed". It also saves the capture date in the database. Then, it searches for inner links in the captured page to feed the crawling process. The crawling of inner links can be disabled via the "crawl links" option in the site settings. The captured pages paths are displayed in the captures view.
error: If an error occurs when capturing a page then the crawler sets its path state to "error". The page will be re-captured when expired or through the console or the API. The paths with errors are displayed in the errors view.

How is a capture analysed?

First of all, the crawler searches for the <meta name="robots" content="nofollow"> tag. If it is found then none of the page inner links will be captured. Otherwise, it looks for the following tags and considers the values of their href attribute as inner links:

<a href="..."> (regular links),
<link rel="canonical" href="..."/> (canonical links),
<link rel="prev" href="..." > and <link rel="next" href="..." > (navigation links).

Caution

Note that the SEO4Ajax crawler does not click on any HTML element. So it is not able to find URLs in buttons or links only activated with by JavaScript and without a href attribute.

Additional rules are applied in order to determine if inner links must be captured or not. Link paths are ignored if a path:

is already in the database and is not expired,
is external to the site URL,
matches one of the ignored URL fragment defined in the site settings,
is detected as invalid (i.e. 400 <= HTTP code <= 599) after applying rewrite rules defined in the site settings,
is detected as a redirect (i.e. 300 <= HTTP code <= 399) after applying rewrite rules defined in the site settings. In this case, the redirected path is analyzed through the same algorithm in order to determine if it must be captured or not.

When considered as valid, the state of the path will be set to "pending" in the database.

How does the crawler select paths to capture?

When the crawler selects the next path to capture, it searches in the database for a path that:

is in "pending" state and,
has the highest priority and,
is the oldest one.

The priority of a path is determined by the following rules:

priority 3: the path is new and the capture request comes from the public API
priority 2: the capture request comes from the console
priority 1: the path is new and has been found in an inner link during the crawling process
priority 0: the page has already been captured and is expired

The priority of a path can also be explicitly set through the authenticated API.

What are rewrite rules?

Rewrite rules allow to rewrite a path just before its state is set to "pending" in the database (i.e. before capturing new or expired paths) or before replying to bots. They have multiple purposes such as removing query parameters, applying HTTP redirections, or returning 404's for invalid pages. The syntax of rewrite rules is based on regular expressions and options similar to the format of configuration files in Apache. More information about rewrite rules syntax can be found here.

How does the crawler know when a page is ready to be captured?

In order to capture successfully a page, the crawler needs to know when the rendering of the page content has finished. Thus it monitors network requests and responses to detect when the page is ready to be captured. When there is no more network activity (i.e. all responses have been received) it gives JavaScript 3 seconds (by default) to render the page. This period can be changed via the "JavaScript timeout" option in the site settings. If a request is sent during this period, the timer is cancelled and this procedure is repeated until the 3 seconds elapsing. When the timer finally expires, the page is captured. This capture process can be better controlled by calling the window.onCaptureReady callback in the web application. This technique is the most efficient if the application knows exactly when the page is rendered. More information on how to use these feature can be found here.

How to configure the crawler behavior?

The crawler permanently looks for site updates. In fact, every time the public API receives a GET request from bots, while sending back the capture, it also verifies it is not expired. If expired, it automatically set its state to "pending" in order to recapture it. Then if the capture contains new or expired inner links, they are also captured and so on. For more control on this default behavior, 3 options can be independently disabled in the site settings in order to prevent the crawler to:

capture new pages,
capture expired pages,
crawl inner links.

More information about these options can be found here.

How to add or remove paths in the database?

There are multiple options to add new paths:

send a GET request to the public API, the page will not be recaptured if it already exists and is not expired
send a POST request to the authenticated API
add manually paths through the context menu of the console
set URL(s) of sitemap(s) in the site settings
drag and drop a sitemap file in the status view of the console (max. sitemap size is 500 URLs)

To remove paths you can either:

use the contextual menu of the console.
use the authenticated API