External Metadata unable to distinguish documents based on file extension

Not so much a question as a bug I've found - possibly already resolved in later versions of Funnelback.

 

We are using asset listings to generate external metadata for injecting data for documents during indexing.  It appears that Funnelback has a limitation (possibly by design?) where it is unable to distinguish between documents (or web pages, I would presume) where the only difference is in the file extension.

 

E.g. I am indexing 2 documents and trying to inject external metadata of the following format:

http://my/imaginary/path/xyz.doc a:"data_1"
http://my/imaginary/path/xyz.pdf a:"data_2"

I would expect that the first document gets set to "data_1" and second document gets set to "data_2" for metaclass a.

 

What actually happens:

  • First document gets "data_1".
  • Second document gets nothing.

If you switch the order of the lines in external_metadata.cfg then:

  • First document gets "data_2"
  • Second document still gets nothing.

If you change any part of the filename or path for one of the documents, anything except the file extension, so that it is different to the other one the result is:

  • Both documents get the relevant data.

If you switch the order they documents are crawled (in start_urls) then the result is (assuming my original external_metadata.cfg):

  • Second document gets "data_1"
  • First document gets nothing.

 

What seems to be happenning:

Funnelback indexer is not able to distinguish that documents with identical paths and filenames but different file extensions are actually different.  This is in keeping with the manuals that say if a duplicate entry is found only the first is used.  It seems Funnelback is seeing them as duplicates and just using the first.

 

This *may* be intentional, for example, to cater for situations where you may have both index.html and index.htm, you may want Funnelback to treat them as the same resource.

 

It also may be a bug that is fixed in a later version.

 

When is this an issue:

We have a lot of cases of this issue since we are providing both formats for most (all?) documents.  I.e. all documents will have at least a .doc version, and most have a .pdf as well.  They are essentially the same document so most often have the same name.

 

Why this is frequently not an issue in Matrix:

Once a file asset is published (assuming public read) the document is moved to the public area for Apache, which among other things adds the asset id to the path.  This results in a different (and unique) path so there's no double-up.

 

This *will* still be an issue if you are not setting documents to Public Read, e.g. trying to set up collections and indexing prior to a site going live in Matrix, documents served from a secure site like an extranet where they are not set to Public Read, etc.

 

Mitigation:

Only thing I can think of ensuring that affected documents do not have the same filename.  Brute-force way may be list all documents sorted by path and name and manually checking for "duplicates" - or some fancy javascript would also do the trick, checking the next row, comparing filenames, etc.).

 

I've tried using search operators to find where metadata classes are empty.  For example, I map every asset's id to the metadata class 'I' - makes testing a breeze.  So every record should have a value for 'I' no matter what.  Trying to find an operator that shows all results where 'I' is null, e.g. I:null, but no luck so far (perhaps better minds than mine have the know-how).

 

Currently is an issue for me only because site is not live yet.  As explained above, once live the problem 'magically' goes away.

 

Funnelback Enterprise v12.2.1.  Not sure if it's just our version, or if there's an indexing option I need to tweak.  Mostly providing this info so if anyone has the same issue they don't spend the time I did clawing out my eyes.

Hi Tim - 

 

Thanks for your very detailed post.  There may be some additional things to investigate which could reveal the issue:

 

1. Investigate index logs

Earlier versions of Funnelback (<14.2) expected empty line breaks at the end of external_metadata.cfg - any issues encountered when processing external metadata should be output in $SEARCH_HOME/data/COLLECTION/VIEW/log/index.log

 

2. Use of URL prefixes

 

I'll see if the Funnelback R&D can confirm this, but URL prefixes are assumed to be the input format.  If multiple lines in the metadata file start with an identical prefix, only the first will be effective. (http://docs.funnelback.com/12.2/external_metadata.html)

 

3. Test queries for empty metadata fields

 

This is a worthwhile debugging method:  the syntax you'll need looks like -

-a:$++ $++

This syntax is searching for any result that does NOT have a value for metadata field 'a'.  The '$++' symbols are token delimiters.

Hi Tim -

 

Looks like this was a known bug in that version of Funnelback, with a patch available for versions back to 12.2.x.

 

Your best course of action is to contact support@funnelback.com requesting the patch (reference 'SUPPORT-1518') .  Once applied, it should take effect after reindexing the collection in question.

Thanks Gordon

 

We'll probably hold off patching at this stage - currently going through a bit of upheaval here at the moment.  Like I said, it's not a show-stopper for us, but thanks for the info idenfiying the bug.  Will be very helpful if it becomes a problem for us in the future. 

 

Also, really appreciate the tip using -a:$++ $++ syntax.  Works a treat and has made some previously very difficult things much, MUCH easier for me.

 

Cheers

No worries.  Best of luck!