I'm currently using Funnelback v15.20.0
I'm trying to limit the number of documents, tweets in this case, that is retrieved by the crawler in a Twitter collection. I've looked at the documentation for crawler configuration options that can be specified in colleciton.cfg under my Twitter collection directory, and I've found 3 options that don't seem to have any effect when crawling Twitter:
- crawler.max_files_stored
- crawler.max_files_per_area
- crawler. overall_crawl_timeout / crawler.overall_crawl_units
I've tried each of these individually and together in different permutations. The Twitter collection does not seem to honor any of these config options.
As a dirty solution, I'm using a Custom Filter to keep track of the number of documents stored per account (extracted from the document uri in the filter). This works, but is added code that we have to maintain on our part, and it is wasteful since the crawler will still crawl all of the (1000+) documents from an account. We are simply discarding the majority of them and preventing them from being indexed via the filter.
Is there any built-in way to limit the number of documents crawled per twitter account (like crawler.max_files_per_area should do)?