Collection stopping crawl at 10k

vmanera · May 23, 2018, 1:06am

I am encountering a collection where the crawler is stopping at exactly 10,000 pages.

What are the kinds of settings that would prevent the crawler from continuing. I believe they are licensed for 25k.

Only setting I can think of is:
crawler.overall_crawl_timeout
This is currently set to 240min but I would not have thought it would consistently cut its crawl phase at a clean 10k

plevan · May 23, 2018, 2:07am

It could be a few things:

Check the server licence and confirm the size (you mentioned you think it’s 25k)
Check collection.cfg for the following settings:
crawler.max_files_per_server
crawler.max_files_stored
-maxdocs option set as an indexer_option

You could be hitting a crawler trap limiter and may need to increase crawler.max_files_per_area which defaults to 10000 documents

Max files stored set as an option for site_profiles.cfg

vmanera · May 23, 2018, 3:00am

My gut feeling is that this is the reason. This collection is crawled from a Seed url containing a generated list of assets, this list is over 10k. I will up this limit
Cheers Peter