Disclaimer (superscript) in pdf being returned in search

kanwaljit.singh · January 23, 2020, 2:25am

Is there a way to tell the crawler and indexing process to ignore superscript strings.
e.g , if the term ‘Housing’ as a superscript 1 which is just saying that there is a disclaimer to it, but the search for term ‘housing1’ is returning that document in search.
Is there a way to stop this in indexing ?
Unfortunately, I can’t point to a live example as it is IP protected, but wanted to still ask. Hope you will understand.

plevan · March 13, 2020, 12:30am

The best approach here is to look at the text of the cached version of the PDF document.

When a PDF is indexed Funnelback uses Tika to extract text from the PDF and whatever text is returned is what is indexed.

I suspect that this is just stripping out any formatting (or it’s possible that your pdf doesn’t really capture the formatting because there is often a disconnect between what you see and what’s in the PDF itself). For example old PDFs have no concept of paragraphs and even things like bolding could cause duplicate letters etc. depending on how the PDF was generated.

After looking at the text you may be able to modify write a filter to modify the extracted text.

you could possibly also add the the words that you’d like ignored to the spelling blacklist, and also the list of stopwords

lh435 · October 23, 2020, 6:14pm

How does one look at the text of the cached version of the PDF document?

lh435 · October 23, 2020, 6:28pm

Oh, I see here: https://training-search.clients.funnelback.com/training/FUNL101.html

2.10. Cached results
Every page that a search engine indexes is stored locally by Funnelback. This cached version of the page can be viewed from the search results, with the search query terms highlighted within the content.

For example searching for the terms quinoa and selecting the cached link from the downward caret icon beside the URL of a result opens a cached version of the page with the queried keywords highlighted within the content. The cached version is the content as it was when Funnelback crawled the URL.