Pages not being crawled (problem solving)

Jim 2020-01-21 21:34:38 UTC #1

There are pages/directories that are NOT being crawled.

I checked the crawl settings, and the pages SHOULD be included via the pattern.
It is NOT an http vs. https issue.
The pages/directories not being crawled ARE linked from other pages that are IN the index (as confirmed via the stored.log file.
There are no robots.txt files in the directory (or parent directories) directing search engines to not crawl.
There are no meta tags or directives in the files themselves with "no index" commands.

Is there a Funnelback log file I can check to see WHY a particular page is NOT in the index? Or, to see what links are followed and NOT followed on a page that is in the index that I KNOW links to the page not being picked up?

dmikulis 2020-02-04 14:34:52 UTC #2

Hi Jim,

For #1, please confirm by following trying the page(s) that is not being crawled in the following APIs. The API UI can be accessed by the "View API UI" option in the "System" menu in the Funnelback Administration dashboard.

GET /collection-info/v1/collections/{collection}/url

This API will test the include/exclude patterns and check whether the URL is in the index, including redirects.

GET /crawler/v1/debug/collections/{collection}/http-request

The debug API will check if the crawler has any issues reaching the provided URL, including redirects.

For #3, are the pages not being crawled linked from other pages that are being crawled, if you visit those pages in your browser with Javascript disabled (i.e. are the links created by Javascript)?

Do the links to the pages that you'd like to be crawled have the rel="nofollow" attribute?

In the last question, the logs that may have the answer are crawl.log.*.gz , where the * is a number. There may be multiple of these logs if multiple servers are being crawled. If the Funnelback crawler encountered the link, there will be an entry with some sort of message -- perhaps there was a non-200 HTTP response code returned or the URL didn't match the include/exclude patterns. If there's no mention of the URL at all, then the link was not made available to the crawler (it may have been generated by Javascript).

Another possibility, though unlikely, is if the crawl is timing out (due to the configured time limit, default 24 hours) and those page(s) were still in the frontier when the crawl finished. Those would show in the collection frontier.log .