How long does it take to crawl your web collections?

Jim · January 23, 2020, 7:15pm

I’m seeing Funnelback take up to 48 hours to index a straight-ahead HTML collection of about 15,000 pages. This seems like a LONG time. What other times are other folks seeing?

And, is there a single log file to check to see exactly how long it look from kick-off to completion?

dmikulis · February 4, 2020, 6:19pm

Hi Jim,

Standard disclaimer – every collection is different with different size HTML pages and server response times.

That being said, 48hours for 15k documents is too long in my opinion.

In order to identifiy the bottleneck, I would recommend checking out the Content Auditor which is located in the Funnelback Marketing dashboard. If the web collection is part of a meta collection, you would view the documents in the meta collection, or if the web collection is stand-alone you could view the documents directly for that collection.

In the Content Auditor, you could look for two factors that can have an impact on crawl time:

Response time: This shows how long the Funnelback web crawler had to wait for the server to send the document
Document size: If there are a lot of large documents (say PDFs), then a long crawl time would be expected

There are some configuration options that could be used to speed things up, but it depends on the environment that Funnelback is running in and it would be advisable to identify the source of the slowness before fiddling with settings.

For a log file to check how long it took, the crawl.log may be the best, it has the ‘Started At’ time near the top and the ‘Finished At’ time near the bottom. There is a helpful log reference article here that you may find useful.
Rather than the log, you may prefer a graphical representation of the update times. This can be found in the Administration dashboard for the particular collection → “Analyse” tab → “View Collection Update History” button .

plevan · February 4, 2020, 9:03pm

It is also worth having a look at the monitor log graphs (with the collection’s live and offline logs) - this will give you more of an idea where the time is being spent for the update.

There are a number of crawl optimisation things that can be done (e.g. crawling with concurrent threads, reducing the sleep between requests) but you want to check out where the bottleneck is first.

Jim · April 1, 2020, 8:37pm

Thanks, all. Very helpful advice that I actually needed to use to problem-solve today.