Following on from this unsuccessful thread, I think there might not be a way for Funnelback to correctly parse/index a MM YYYY or MMMM YYYY format date unless any funnelback staff know differently?
Sooo.. I think i need to create a script to do some fancy parsing/conversion myself. Looked around the documentation and I seem to be able to create a groovy script to run post_gather to convert the date myself. First question - is this the correct way to change the date?
However -I’m struggling to see where I start with this. The file that is gathered is XML and the date(d) metadata is mapped to a specific field in the XML. I think I need to somehow write a groovy script to access and change that value in certain situations. Which leads me on to my second question - any tips on where to start?
Funnelback current supports dates in the following formats:
If Funnleback is interpreting the dates supplied in a less than ideal manner, I would recommend modifying the documents so that they use an unambiguous date format like the “Long form” (DD MMMM YYYY e.g 31 January 2001). This can be done at the source or by using filters.
If you go down the filter path, please see this following link for an example of how to manipulate documents:
The general approach for the filter will be something like:
I’ll give that a bash with jsoup, although I don’t think i can use the document.select selector due to the documents being xml. Might be able to use getElementsByTag though instead.
You should start by converting (or re-creating) the collection from a local collection into either a web collection or a custom collection (depending on what your pre-index workflow does). Use of local collections isn’t recommended anymore because of limitations such as the one you’ve described.
If it’s just downloading the XML file then use a web collection.
Once you’ve converted the collection you’ll be able to use filtering.
There’s some general advice on the collection type to use for various things here: Redirect Notice
Note: you should use the SplitXml filter that Gioan mentioned in a previous post to split the XML into individual records then chain this with a filter that does the manipulation.
Thanks @plevan, unfortunately I can’t change this to a web collection due the files not being available over http(s). The files are also >300MB each and I was running into difficulty with java heap space issues using any other collection type.
There are some other options - usually a web collection will work for this but a custom collection may be more appropriate in this instance - it can be used to connect to whatever the arbitrary source is and fetch the file. It can then be hooked up to a standard filter chain as for other collections.
You may also be able to use a filecopy collection in place of the local collection (this will enable you to filter) but it’s not ideal as filecopy collections have some quirks associated with them that may need to be worked around depending on what your collection is doing. This is probably a quicker solution but not ‘best practice’.
Heap space issues can often be worked around by adjusting the relevant heap (e.g. gather.max_heap_size) if that’s where you are running out of memory.
Source system exports 6 XML files, each ~150-300MB large (with a total of 210K records)
A script runs to perform XSLT transformation on the XML, then copies them from the source server to /var/tmp on the funnelback server.
Funnelback performs a local index on them from /var/tmp.
We ran into difficulty using web crawler (the files were too big to load/crawl) and filecopier (java heap issues when copying files, even if we massively increased the heap size).
Best wishes,
Andrew