Crawl failure - Unable to generate extracted text signature

Hi,

Does anyone know what the following mesage means in the log files?

“Unable to generate extracted text signature”

When I perform a crawl of my site it fails at the crawl stage.

I can access the site via a browser on the server so accessing the site for crawling shouldn’t be a problem?

Probably it’s a good idea to check the crawl.log file(also the zipped files) to see what the server is returning.
Probably you can use curl on a problematic url.
Is it returning encoded content when not requested specifically?

I have checked the crawl.log and nothing that I can see in there stands out which leads me to what I should be looking at next. If I open up a browser on the server and and enter in the address for it, the site shows up.

Below is what is in the crawl log -


FunnelBack: Version: 15.10.0.0
JVM: Java HotSpot™ 64-Bit Server VM 25.25-b02 (Oracle Corporation)
Operating System: Windows Server 2008 R2 6.1 (amd64)
Encoding: UTF-8
FunnelBack: Started at: Tue Mar 13 04:00:03 EST 2018
FunnelBack: License verified.
Detailed log: E:\funnelback\data\Test-site\offline\log\crawler.inline_filter.log
FunnelBack: Overall Crawl Timeout: 82800000 (ms)
Funnelback: Using pre-crawl authentication.
FunnelBack: File Store Limit: 500000
MultipleRequestsFrontier: Using specified internal frontier type for deferred request queue: com.funnelback.common.frontier.DiskFIFOFrontier
FunnelBack: Loaded: com.funnelback.crawler.NetCrawler
FunnelBack: Loaded: com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000
FunnelBack: Loaded: com.funnelback.crawler.scanner.RegExpHTMLScanner
FunnelBack: Loaded: com.funnelback.common.store.WarcStore
FunnelBack: Loaded: com.funnelback.crawler.StandardPolicy
FunnelBack: Loaded: com.funnelback.common.revisit.AlwaysRevisitPolicy
Cache: Table Initial Capacity: 10000
Cache: LRUCache Max Size: 500000
INFO: No portfolio information file: E:\funnelback\conf\Test-site\sites-by-portfolio.csv
INFO: No seed servers information file
CrawlStatistics: Loaded statistics classes.
FunnelBack: Loaded caches.
FunnelBack: Mime-types parsed [text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml]
FunnelBack: Protocols accepted [http,https]
FunnelBack: Robot agent matching [Funnelback]
FunnelBack: Max Size In-Memory URL Buffer Cache: 5000
FunnelBack: Storing header information
Funnelback: Added 1 URLs to frontier.
FunnelBack: Control passed to coordinator.
Coordinator: Added 1 URLs to URL Cache from start_urls_file
Coordinator: Started 10 crawler thread(s) …
Coordinator: Using overall timeout.
HTTPClientTimedRequest: Trust Everyone
HTTPClientTimedRequest: Accept/send all cookies
Monitor: Interval (secs): 30 Checkpoint Interval (secs): 1800
Monitor: Checking Config File: E:\funnelback\conf\Test-site\collection.cfg
Monitor: Printing statistics to monitor.log and crawl.log.1
Coordinator: Crawler 0 signalled completion.
Coordinator: Printing out final values to servers.log and domains.log
Coordinator: Final Checkpoint and Totals …
DNSCache: Maximum cache size: 200000
Coordinator: Finished final checkpoint.
Date: Tue Mar 13 04:00:06 EST 2018
URLs Processed: 1
Duplicates: 0
HTTP Redirects: 0
HTTP Bad Responses: 0
Network (I/O) Errors: 0
Robot NoFollow URLs: 0
Threads Active: 1
Frontiers Active: 0
Bytes In (MB): 0
Bytes Out (MB): 0
Used Memory (MB): 21
Total Memory (MB): 120
Cache Size: 1
Frontier Size: 0
Total Data Stored (MB): 0
Total Web Servers: 1
Total URLs Downloaded: 0
Total URLs Stored: 0
Coordinator: Printing out crawl statistics to .stat files in log directory.
Coordinator: Attempting to deactivate crawler threads …
Coordinator sleeping for 5 seconds before final shutdown …
Coordinator: Closing URLStore.
Coordinator: Dumping frontier to log for analysis …
Coordinator: Finished dumping frontier.
Coordinator: Finished at: Tue Mar 13 04:00:11 EST 2018
Coordinator: Finished crawl. Deactivating threads and exiting …
Coordinator: WARNING - URLStore reported no documents stored!
Command finished with exit code: 1

The short answer is that when the HTML file was parsed it didn’t extract any useful text from the document.

This is usually caused by some error (such as authentication) resulting in no content being returned to Funnelback when it attempts to crawl it.

It can also happen for binary documents when they fail to filter (convert to text) - this is normally caused by the document being protected (eg. via password or encryption).

The crawl.log.X.gz log files might also have more details about the error, as might the crawler.inline_filter.log if the cause was potentially a filter being unable to convert the document.