Groovy post_gather script - where do I start?

aw282 · January 13, 2020, 11:47am

Hi

Following on from this unsuccessful thread, I think there might not be a way for Funnelback to correctly parse/index a MM YYYY or MMMM YYYY format date unless any funnelback staff know differently?

Sooo.. I think i need to create a script to do some fancy parsing/conversion myself. Looked around the documentation and I seem to be able to create a groovy script to run post_gather to convert the date myself. First question - is this the correct way to change the date?

However -I’m struggling to see where I start with this. The file that is gathered is XML and the date(d) metadata is mapped to a specific field in the XML. I think I need to somehow write a groovy script to access and change that value in certain situations. Which leads me on to my second question - any tips on where to start?

Edit - or is it a custom filter i need? https://docs.funnelback.com/15.24/develop/programming-options/document-filtering/writing-filters.html

Thanks!
Andrew

gtran · January 13, 2020, 2:04pm

Hi Andrew,

Funnelback current supports dates in the following formats:

If Funnleback is interpreting the dates supplied in a less than ideal manner, I would recommend modifying the documents so that they use an unambiguous date format like the “Long form” (DD MMMM YYYY e.g 31 January 2001). This can be done at the source or by using filters.

If you go down the filter path, please see this following link for an example of how to manipulate documents:

The general approach for the filter will be something like:

“select” the date using the css selector
Convert the date string into the a date object. This webpage might be useful Groovy Goodness: Working with Dates - Messages from mrhaki
Add the date back into the document and ensure that this field is mapped in Funnelback using the metadata mapping screen.

Please note that you will need to kick off new crawls to test the filters.

Let me know if you need any more information.

Hope this helps.

Thanks

~Gioan

aw282 · January 13, 2020, 2:30pm

That’s great thank you @gtran !

I’ll give that a bash with jsoup, although I don’t think i can use the document.select selector due to the documents being xml. Might be able to use getElementsByTag though instead.

Thanks again!

gtran · January 13, 2020, 3:30pm

ahh right. If it is an XML, you might be able to take bits and piece from the following filters:

In particular the KGMetadata.groovy does some processing around XMLs.

Hope this helps.

~Gioan

aw282 · January 15, 2020, 9:13am

Had a look at this yesterday and hit a bit of a roadblock with getting the groovy script to run.

I think it’s because the filters run during the “gather” phase but I’m using a local collection which doesn’t use the gather phase. The logs show:

> Phase: Gathering content. (GatherPhase)
>   	 skipped because Operation not supported on this collection type, in 0.2s

Is there a way to still run filters on local collections?
Thanks!

EDIT - Decided to just do the conversion in my XSLT before the files even hit funnelback.

plevan · January 15, 2020, 8:37pm

Hi Andrew,

You should start by converting (or re-creating) the collection from a local collection into either a web collection or a custom collection (depending on what your pre-index workflow does). Use of local collections isn’t recommended anymore because of limitations such as the one you’ve described.

If it’s just downloading the XML file then use a web collection.

Once you’ve converted the collection you’ll be able to use filtering.

There’s some general advice on the collection type to use for various things here: Redirect Notice

There is also an exercise that covers the steps for configuring a web collection to download an split XML here: http://training-search.clients.funnelback.com/training/FUNL202.html#_exercise_8_creating_an_xml_collection

Note: you should use the SplitXml filter that Gioan mentioned in a previous post to split the XML into individual records then chain this with a filter that does the manipulation.

aw282 · January 16, 2020, 8:28am

Thanks @plevan, unfortunately I can’t change this to a web collection due the files not being available over http(s). The files are also >300MB each and I was running into difficulty with java heap space issues using any other collection type.

plevan · January 16, 2020, 9:37pm

How are the files currently obtained?

There are some other options - usually a web collection will work for this but a custom collection may be more appropriate in this instance - it can be used to connect to whatever the arbitrary source is and fetch the file. It can then be hooked up to a standard filter chain as for other collections.

You may also be able to use a filecopy collection in place of the local collection (this will enable you to filter) but it’s not ideal as filecopy collections have some quirks associated with them that may need to be worked around depending on what your collection is doing. This is probably a quicker solution but not ‘best practice’.

Heap space issues can often be worked around by adjusting the relevant heap (e.g. gather.max_heap_size) if that’s where you are running out of memory.

aw282 · January 17, 2020, 11:28am

Thanks @plevan

Current workflow is as follows:

Source system exports 6 XML files, each ~150-300MB large (with a total of 210K records)
A script runs to perform XSLT transformation on the XML, then copies them from the source server to /var/tmp on the funnelback server.
Funnelback performs a local index on them from /var/tmp.

We ran into difficulty using web crawler (the files were too big to load/crawl) and filecopier (java heap issues when copying files, even if we massively increased the heap size).
Best wishes,
Andrew