Unable to split XML file in to multiple records

Vux 2017-08-02 03:25:22 UTC #1

I have set up a web collection pointing to a XML file and mapped out the individual Xpaths in the xml.cfg as well as having mapped 'document' to the parent node of each row I want to split on and 'docurl' to a row that serves as a unique identifier for each record.

The update is running without any issues, however when a query is run the .xml file is returned as the sole result, with each of the records rows I have mapped in the xml.cfg being returned in one blob for each row.

Here is the JSON output, I noticed that the fileType is being returned as html even though the file is xml and the headers are returning it as such. Prior to adding indexer_options=-forcexml to collection.cfg the mapped XML rows weren't returning as metadata at all. Just wondering if there's anything obvious I have missed? I can type out the whole process I have taken so far if that would help.

  "results": [
    {
      "rank": 1,
      "score": 1000,
      "title": "www.matrix01.act.gov.au/__data/assets/xml_file/0008/1088531/test.xml",
      "collection": "ac-builders",
      "component": 0,
      "collapsed": null,
      "liveUrl": "http://www.matrix01.act.gov.au/__data/assets/xml_file/0008/1088531/test.xml",
      "summary": "2XL CONSTRUCTIONS PTY LTD 2016247 Builder Class C 8.",
      "cacheUrl": "/s/cache?collection=ac-builders&doc=funnelback-web-crawl.warc&off=279&len=-1&url=http%3A%2F%2Fwww.matrix01.act.gov.au%2F__data%2Fassets%2Fxml_file%2F0008%2F1088531%2Ftest.xml&profile=_default",
      "date": 1500991200000,
      "fileSize": 3508,
      "fileType": "html",
      "tier": 1,
      "docNum": 0,
      "exploreLink": null,
      "kmFromOrigin": null,
      "metaData": {
        "expirydate": "8 April 2017|18 November 2018|8 April 2018|11 September 2018|3 September 2017|30 April 2018|13 May 2017|15 August 2018|5 December 2019",
        "occupation": "Builder|Builder|Builder|Builder|Builder|Builder|Builder|Builder|Builder",
        "cln": "2016247|20151084|2016204|2015931|19968065|2008265|19894869|2012975|20121515",
        "phone": "0439490500|0408994464|0435995300|0421337744|0262601611|0438686367|0417217192|0425404085|0421274542",
        "surname": "2XL CONSTRUCTIONS PTY LTD|35 DEGREES PTY LTD|3D CONCEPTS PTY LTD|A & A BUILDING SERVICES PTY LIMITED|A & A CONTRACTORS PTY LIMITED|A & B DAL CORTIVO PTY LTD|A & D CONSTRUCTIONS|A & J BATHROOMS AUSTRALIA PTY LTD|A & J INVESTMENTS (CANBERRA) PTY LTD",
        "description": "Class C|Class C|Class C|Class B|Class C|Class C|Class C|Class B|Class C",
        "class": "3|3|3|2|3|3|3|2|3"
      },
      "tags": [
        
      ],
      "quickLinks": null,
      "displayUrl": "http://www.matrix01.act.gov.au/__data/assets/xml_file/0008/1088531/test.xml",
      "clickTrackingUrl": "/s/redirect?collection=ac-builders&url=http%3A%2F%2Fwww.matrix01.act.gov.au%2F__data%2Fassets%2Fxml_file%2F0008%2F1088531%2Ftest.xml&index_url=http%3A%2F%2Fwww.matrix01.act.gov.au%2F__data%2Fassets%2Fxml_file%2F0008%2F1088531%2Ftest.xml&auth=OSBlHAoyk3FwHJlnttdQLA&profile=_default&rank=1&query=%21padrenull",
      "explain": null,
      "indexUrl": "http://www.matrix01.act.gov.au/__data/assets/xml_file/0008/1088531/test.xml",
      "gscopesSet": [
        
      ],
      "customData": {
        
      },
      "documentVisibleToUser": true
    }
  ],

gordongrace 2017-08-02 08:06:47 UTC #2

Hi @Vux -

Pasting a copy of your xml.cfg file would help uncover the issue here. Assuming the 'cln' number is unique (and a reasonable approximation of a document's URL), you'll need your xml.cfg file to contain:

document,/response/row/row
docurl,/response/row/row/cola_licence_number

Looking at http://www.matrix01.act.gov.au/__data/assets/xml_file/0008/1088531/test.xml, it appears that the row element may be unnecessarily nested.

Define which element to split documents on

document,/response/row/row

Define which element from a document should be used as a URL

docurl,/response/row/row/cola_licence_number

surname,1,,//surname
cln,0,,//cola_licence_number
occupation,0,,//occupation
description,0,,//description
expirydate,0,,//expiry_date
class,0,,//class
phone,0,,//phone_number

Thanks for that link, I've poured over https://docs.funnelback.com/15.10/more/extra/xml_cfg.html#entire-document more than a few times in trying to debug this but have not had any luck with it.

aleks 2017-08-03 08:42:20 UTC #4

I didn't think you could index xml documents like this in a 'web collection'

plevan 2017-08-03 22:54:27 UTC #5

Hi Vux,

If you're using a web collection, and running Funnelback 15.6 or earlier then you'll need to remove the Jsoup filter from the filter chain, or if you're just indexing the single XML file then you can disable filtering altogether, or use a custom collection with the XML custom gatherer (can be downloaded from GitHub: https://github.com/funnelback/custom-gatherer-xml). I believe this custom gatherer has already been used for some of the other XML based collections that form part of the ACT Government search.

There are a number of filters that run for content auditor that actually write some additional content to the file that's stored and this breaks the document structure (so your document: directive won't work).

Give that a go and see how you go.

cheers,
Peter

Vux 2017-08-03 23:26:51 UTC #6

You're a legend Peter, thank you.

Removing the filter has fixed it. I'll give a custom collection a try like both of you suggested if we ever end up needing to include multiple XML files for this setup. I did originally set one up to use the JSON custom gatherer, however the JSON file I was crawling had spaces where it shouldn't and was causing errors in the crawl, it's just sitting there ready to be worked on.