Not so much a question as a bug I've found - possibly already resolved in later versions of Funnelback.
We are using asset listings to generate external metadata for injecting data for documents during indexing. It appears that Funnelback has a limitation (possibly by design?) where it is unable to distinguish between documents (or web pages, I would presume) where the only difference is in the file extension.
E.g. I am indexing 2 documents and trying to inject external metadata of the following format:
http://my/imaginary/path/xyz.doc a:"data_1"
http://my/imaginary/path/xyz.pdf a:"data_2"
I would expect that the first document gets set to "data_1" and second document gets set to "data_2" for metaclass a.
What actually happens:
- First document gets "data_1".
- Second document gets nothing.
If you switch the order of the lines in external_metadata.cfg then:
- First document gets "data_2"
- Second document still gets nothing.
If you change any part of the filename or path for one of the documents, anything except the file extension, so that it is different to the other one the result is:
- Both documents get the relevant data.
If you switch the order they documents are crawled (in start_urls) then the result is (assuming my original external_metadata.cfg):
- Second document gets "data_1"
- First document gets nothing.
What seems to be happenning:
Funnelback indexer is not able to distinguish that documents with identical paths and filenames but different file extensions are actually different. This is in keeping with the manuals that say if a duplicate entry is found only the first is used. It seems Funnelback is seeing them as duplicates and just using the first.
This *may* be intentional, for example, to cater for situations where you may have both index.html and index.htm, you may want Funnelback to treat them as the same resource.
It also may be a bug that is fixed in a later version.
When is this an issue:
We have a lot of cases of this issue since we are providing both formats for most (all?) documents. I.e. all documents will have at least a .doc version, and most have a .pdf as well. They are essentially the same document so most often have the same name.
Why this is frequently not an issue in Matrix:
Once a file asset is published (assuming public read) the document is moved to the public area for Apache, which among other things adds the asset id to the path. This results in a different (and unique) path so there's no double-up.
This *will* still be an issue if you are not setting documents to Public Read, e.g. trying to set up collections and indexing prior to a site going live in Matrix, documents served from a secure site like an extranet where they are not set to Public Read, etc.
Mitigation:
Only thing I can think of ensuring that affected documents do not have the same filename. Brute-force way may be list all documents sorted by path and name and manually checking for "duplicates" - or some fancy javascript would also do the trick, checking the next row, comparing filenames, etc.).
I've tried using search operators to find where metadata classes are empty. For example, I map every asset's id to the metadata class 'I' - makes testing a breeze. So every record should have a value for 'I' no matter what. Trying to find an operator that shows all results where 'I' is null, e.g. I:null, but no luck so far (perhaps better minds than mine have the know-how).
Currently is an issue for me only because site is not live yet. As explained above, once live the problem 'magically' goes away.
Funnelback Enterprise v12.2.1. Not sure if it's just our version, or if there's an indexing option I need to tweak. Mostly providing this info so if anyone has the same issue they don't spend the time I did clawing out my eyes.