Hello
A client of ours is having an issue with crawling any .docx file. A sample of the error from the crawl.log is as follows
Crawler 4: Error: Crawling URL: https://xxxxxxxx/__data/assets/word_doc/0018/1764/policy_plain_language.docx java.lang.RuntimeException: Error extracting text for URL: https://xxxxxxxx/__data/assets/word_doc/0018/1764/policy_plain_language.docx
at com.funnelback.crawler.utils.CrawlerUtils.filter(CrawlerUtils.java:157)
at com.funnelback.crawler.WebCrawler.processGET(WebCrawler.java:940)
at com.funnelback.crawler.NetCrawler.processURL(NetCrawler.java:157)
at com.funnelback.crawler.WebCrawler.crawl(WebCrawler.java:425)
at com.funnelback.crawler.Crawler.run(Crawler.java:407)
Caused by: java.lang.Exception: Couldn't filter stream : application-octet-stream.bin
at com.funnelback.common.filter.Filter.filterToHtmlBytes(Filter.java:119)
at com.funnelback.common.filter.Filter.filterToHtmlBytes(Filter.java:100)
at com.funnelback.crawler.utils.CrawlerUtils.filter(CrawlerUtils.java:153)
... 4 more
Is there an inline filter change or some other fix we can apply to solve this problem? There are no custom groovy filters or custom values set for filter.tika.types=.
Cheers