Filecopier collection file errors

I have a file copy collection where a number of files cannot be parsed properly and are listed with java errors in filecopier.log

How do I find out which file is causing this error when it’s not listed in the file copier log?

Example:

2017-11-14 03:05:20,806 [WEBMPRP42.dmz.aqa.org.uk-Fetcher-0 (Thread 13)] INFO bytes.FlatFileStore - Stored content for key: file:///opt/funnelback/custom_data/mount/store/resources/materials-technology/AQA-3740-SAMS.PDF at /opt/funnelback/bin/../data/aqa-pdf-props-fc/offline/checkpoint/temp-store/a/7/8/1/a781f9e27946c51109b92aea320f6b3f.PDF Stored metadata for key: file:///opt/funnelback/custom_data/mount/store/resources/materials-technology/AQA-3740-SAMS.PDF at /opt/funnelback/bin/../data/aqa-pdf-props-fc/offline/checkpoint/temp-store/a/7/8/1/a781f9e27946c51109b92aea320f6b3f.PDF.fun.txt
2017-11-14 03:05:20,819 [WEBMPRP42.dmz.aqa.org.uk-Walker-2 (Thread 11)] ERROR filter.DocumentFixerFilterProvider - Aborted Document Fixing for ‘null’
java.lang.RuntimeException: StoppableCharSequence took too long - aborted
at com.funnelback.common.text.StoppableCharSequence.charAt(StoppableCharSequence.java:56)
at java.lang.Character.codePointAt(Character.java:4668)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3693)
at java.util.regex.Pattern$Curly.match1(Pattern.java:4191)
at java.util.regex.Pattern$Curly.match(Pattern.java:4134)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:592)
at java.util.regex.Matcher.replaceAll(Matcher.java:907)
at com.funnelback.common.filter.DocumentFixerFilterProvider.findTitle(DocumentFixerFilterProvider.java:481)
at com.funnelback.common.filter.DocumentFixerFilterProvider.getGoodTitle(DocumentFixerFilterProvider.java:383)
at com.funnelback.common.filter.DocumentFixerFilterProvider.fixBadTitle(DocumentFixerFilterProvider.java:283)
at com.funnelback.common.filter.DocumentFixerFilterProvider.filterStream_internal(DocumentFixerFilterProvider.java:118)
at com.funnelback.common.filter.DocumentFixerFilterProvider.filterStream(DocumentFixerFilterProvider.java:95)
at com.funnelback.common.filter.DocumentFixerFilterProvider.filterStream(DocumentFixerFilterProvider.java:143)
at com.funnelback.common.filter.ChainFilterProvider.filterStream(ChainFilterProvider.java:219)
at com.funnelback.common.filter.Filter.filterToHtml(Filter.java:83)
at com.funnelback.common.filter.Filter.filterToHtml(Filter.java:119)
at com.funnelback.filecopier.task.FilterFileTask.process(FilterFileTask.java:66)
at com.funnelback.common.workqueue.Worker.run(Worker.java:169)
2017-11-14 03:05:20,887 [WEBMPRP42.dmz.aqa.org.uk-Fetcher-1 (Thread 14)] INFO bytes.FlatFileStore - Stored content for key: file:///opt/funnelback/custom_data/mount/store/resources/materials-technology/AQA-3740-PS.PDF at /opt/funnelback/bin/../data/aqa-pdf-props-fc/offline/checkpoint/temp-store/0/2/4/c/024c8294fb00539e29db446c3486e1e5.PDF Stored metadata for key: file:///opt/funnelback/custom_data/mount/store/resources/materials-technology/AQA-3740-PS.PDF at /opt/funnelback/bin/../data/aqa-pdf-props-fc/offline/checkpoint/temp-store/0/2/4/c/024c8294fb00539e29db446c3486e1e5.PDF.fun.txt

Hi @aleks -

Looking at the first couple of lines in that log excerpt, it looks like the file that is tripping up the filter here is:

file:///opt/funnelback/custom_data/mount/store/resources/materials-technology/AQA-3740-SAMS.PDF

You could either exclude this file from the collection via exclude patterns, or consider disabling the DocumentFixerFilterProvider from the filter classes.

See also:

null? that is an odd char to be getting, tika should be converting the pdf to text/html.

Try crawling that single pdf file only with filter.classes=TikaFilterProvider in collection.cfg.

lets find out what happened. Try running a search like !padrenull, if you get a single result then it looks like filtering might have worked. Click on the cached copy, does it look correct? If you don’t see the cache copy maybe the pdf can not be converted.