Hi Jim,
For #1, please confirm by following trying the page(s) that is not being crawled in the following APIs. The API UI can be accessed by the "View API UI" option in the "System" menu in the Funnelback Administration dashboard.
GET /collection-info/v1/collections/{collection}/url
This API will test the include/exclude patterns and check whether the URL is in the index, including redirects.
GET /crawler/v1/debug/collections/{collection}/http-request
The debug API will check if the crawler has any issues reaching the provided URL, including redirects.
For #3, are the pages not being crawled linked from other pages that are being crawled, if you visit those pages in your browser with Javascript disabled (i.e. are the links created by Javascript)?
Do the links to the pages that you'd like to be crawled have the rel="nofollow"
attribute?
In the last question, the logs that may have the answer are crawl.log.*.gz
, where the *
is a number. There may be multiple of these logs if multiple servers are being crawled. If the Funnelback crawler encountered the link, there will be an entry with some sort of message -- perhaps there was a non-200 HTTP response code returned or the URL didn't match the include/exclude patterns. If there's no mention of the URL at all, then the link was not made available to the crawler (it may have been generated by Javascript).
Another possibility, though unlikely, is if the crawl is timing out (due to the configured time limit, default 24 hours) and those page(s) were still in the frontier when the crawl finished. Those would show in the collection frontier.log
.