Noindex option in Funnelback

Hi,

I have some questions about using the noindex option in Funnelback (PaDRE (Parallel Document Retrieval Engine) - Funnelback Documentation - Version 15.16.0)

I am looking at ways to speed up the crawling and processing of re-indexing our site because currently it is quite slow and it can take several days for an incremental update to happen.

  1. If I were to use the noindex option to exclude the primary navigation, header and footer content, would that help in speeding up the crawling process as there is “technically” less for it to crawl?

  2. For people who do exclude those primary menu links, how do you go about indexing those pages from the primary navigation? Do you create a “dummy” orphan page with links to those pages and get Funnelback to crawl that or do you add those individual urls to the list of “included” things to crawl?

Thanks.

Funnelback noindex comment tags won’t make any different to the time taken to crawl - they just hide areas of the page from what the indexer considers as page content. They also don’t stop the crawler from following links contained within the noindex area.

Robots.txt can make a difference as this can prevent the crawler from accessing parts of the site that shouldn’t be included in the index (but please note that this only affects include/exclude rules - the crawler doesn’t pay any attention to other directives about visit frequency).

Similarly use of page level robots directives (noindex/nofollow etc) can help too.

You can also adjust some other crawler settings to increase crawl speed or change the frequency of page visits.

  1. Use include/exclude rules to ensure you only crawl the parts of the site you are interested in
  2. Revisit policies can be applied globally or per site to change how often Funnelback will visit certain sites
    Site profiles - Funnelback Documentation - Version 15.16.0
    crawler.classes.RevisitPolicy (collection.cfg setting) - Funnelback Documentation - Version 15.16.0
  3. Site profiles can also be used to add additional crawler threads (the maximum parallel requests setting). This can dramatically speed things up if your web server can handle the additional load, and the Funnelback server is appropriately resourced.