Cannot Limit Twitter Collection Crawler

marknemm · November 5, 2019, 6:16pm

I’m currently using Funnelback v15.20.0

I’m trying to limit the number of documents, tweets in this case, that is retrieved by the crawler in a Twitter collection. I’ve looked at the documentation for crawler configuration options that can be specified in colleciton.cfg under my Twitter collection directory, and I’ve found 3 options that don’t seem to have any effect when crawling Twitter:

crawler.max_files_stored
crawler.max_files_per_area
crawler. overall_crawl_timeout / crawler.overall_crawl_units

I’ve tried each of these individually and together in different permutations. The Twitter collection does not seem to honor any of these config options.

As a dirty solution, I’m using a Custom Filter to keep track of the number of documents stored per account (extracted from the document uri in the filter). This works, but is added code that we have to maintain on our part, and it is wasteful since the crawler will still crawl all of the (1000+) documents from an account. We are simply discarding the majority of them and preventing them from being indexed via the filter.

Is there any built-in way to limit the number of documents crawled per twitter account (like crawler.max_files_per_area should do)?

plevan · November 6, 2019, 8:58pm

The crawler.* settings you mention only apply to web collections.

Twitter collections currently don’t provide any options for limiting the gather process . I will raise a product improvement ticket.

dmikulis · November 6, 2019, 9:20pm

A different workaround is to limit the amount of results gathered to a timeframe (i.e. tweets from the last 1 year or the last 18 months).

A filter for that is included in the public Stencils repository here:

This still has the flaw where those tweets are still gathered but just dropped before being indexed.

marknemm · November 7, 2019, 3:46pm

Thanks for the response. We look forward to seeing the functionality in the future!