.docx Filtering Issue

wparkinson 2017-02-21 01:01:44 UTC #1

Hello

A client of ours is having an issue with crawling any .docx file. A sample of the error from the crawl.log is as follows

Crawler 4: Error: Crawling URL: https://xxxxxxxx/__data/assets/word_doc/0018/1764/policy_plain_language.docx java.lang.RuntimeException: Error extracting text for URL: https://xxxxxxxx/__data/assets/word_doc/0018/1764/policy_plain_language.docx

     at com.funnelback.crawler.utils.CrawlerUtils.filter(CrawlerUtils.java:157)

     at com.funnelback.crawler.WebCrawler.processGET(WebCrawler.java:940)

     at com.funnelback.crawler.NetCrawler.processURL(NetCrawler.java:157)

     at com.funnelback.crawler.WebCrawler.crawl(WebCrawler.java:425)

     at com.funnelback.crawler.Crawler.run(Crawler.java:407)

Caused by: java.lang.Exception: Couldn't filter stream : application-octet-stream.bin

     at com.funnelback.common.filter.Filter.filterToHtmlBytes(Filter.java:119)

     at com.funnelback.common.filter.Filter.filterToHtmlBytes(Filter.java:100)

     at com.funnelback.crawler.utils.CrawlerUtils.filter(CrawlerUtils.java:153)

     ... 4 more

Is there an inline filter change or some other fix we can apply to solve this problem? There are no custom groovy filters or custom values set for filter.tika.types=.

Cheers

gtran 2017-02-21 01:15:34 UTC #2

hmmmm...Do you have the details of their server?? i.e. Funnelback Version, Windows or Linux...squiz cloud..self hosted??

wparkinson 2017-05-31 01:59:54 UTC #3

Sorry for the late reply on this, server is RHEL 6.8, 2 CPU, 4GB RAM. The server is self hosted

plevan 2017-05-31 04:33:55 UTC #4

If it's coming back with a bin-octet stream MIME type I'd check that is probably the cause (not the fact that it's docx). I'd check some of those links and confirm the mime type returned is correct and fix this before looking at Tika bugs.

It looks like the web server should be sending the following content type response header:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

wparkinson 2017-05-31 04:56:14 UTC #5

Cheers Pete, that seems to be the problem. I'll try changing the content type on the Matrix end.