Indexing XML files

DarrenBradford 2014-08-27 12:40:36 UTC #1

Hi all,

Does anyone have any experience of indexing XML files with Funnelback?

I want to crawl and index a number of XML files that contain public academic staff data with the aim of using php to parse Funnelback's XML service and create an expert directory search, but I'm stumbling at the first hurdle.

We created a local collection last year with some XML files in, but when I do a keyword search on terms I definitely know should be in at least one of those files, I don't see the results I'm expecting.

I've read http://docs.funnelback.com/13.2/xml_documents.html but I'm none the wiser really. It says I need to include this

crawler.parser.mimeTypes=text/html,text/plain,text/xml in collection.cfg but I don't know where to put it?

Any advice on how to get started with this that expands on the details in the URL above would be great!

Thanks

Darren Bradford

University of Liverpool

gordongrace 2014-08-27 23:52:44 UTC #2

Hi there, Darren -

The documentation URL you've looked at is certainly a good starting point.

If you've created a local collection, and it's updating successfully (the collection update logs could confirm this), the issue is probably with an indexing and /or query processing configuration.

Before you begin, try some test queries on the collection in the Public UI:

'!nullquery' will run a query that will show you all items that have been indexed. If you're using v13+ and the default Public UI template, each result will have a green arrow next to its URL that will allow you to view the cached copy. This will help you confirm that the collection is at least gathering the content as expected.

Observe the XML structures contained within this cached copy - you'll probably be familiar with the XML already, but note which of these XML fields are likely to be useful for searching / displaying / faceting

Using an example from Project Gutenberg's XML dump, I've got a local collection with each record contained as its own XML file:

<rdf:RDF xmlns:cc="http://web.resource.org/cc/" xmlns:dcam="http://purl.org/dc/dcam/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:marcrel="http://id.loc.gov/vocabulary/relators" xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xml:base="http://www.gutenberg.org/">
<cc:Work rdf:about="feeds/catalog.rdf">
<cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html" />
</cc:Work>
<pgterms:ebook rdf:about="ebooks/1342">
<dcterms:creator rdf:resource="2009/agents/68" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/1342.epub.noimages" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/1342.kindle.noimages" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/1342.plucker" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/1342.qioo" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/1342.txt.utf-8" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342-h.zip" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342-h/1342-h.htm" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342-pdf.pdf" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342-pdf.zip" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342.txt" />
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/files/1342/1342.zip" />
<dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1998-06-01</dcterms:issued>
<dcterms:language rdf:datatype="http://purl.org/dc/terms/RFC4646">en</dcterms:language>
<dcterms:license rdf:resource="license" />
<dcterms:publisher>Project Gutenberg</dcterms:publisher>
<dcterms:rights>Public domain in the USA.</dcterms:rights>
<dcterms:subject>
<dcterms:subject>
<dcterms:title>Pride and Prejudice</dcterms:title>
<dcterms:type></pgterms:ebook>
<pgterms:agent rdf:about="2009/agents/68">
<pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1775</pgterms:birthdate>
<pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1817</pgterms:deathdate>
<pgterms:name>Austen, Jane</pgterms:name>
<pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Jane_Austen" />
</pgterms:agent>
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Jane_Austen">
<dcterms:description>Wikipedia</dcterms:description>
</rdf:Description>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342-h/1342-h.htm">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">821974</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T06:59:48</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342-h.zip">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">270317</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T07:00:26</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1342.epub.noimages">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">285830</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T23:27:58.417724</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1342.kindle.noimages">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1198390</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T23:28:01.943507</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342-pdf.pdf">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1570692</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-11-28T13:14:32</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342-pdf.zip">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1232905</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-11-28T13:19:30</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1342.plucker">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">422108</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T23:28:05.568284</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1342.qioo">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">321760</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T23:27:57.496782</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1342.txt.utf-8">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">717569</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T23:27:57.273790</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342.txt">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">717597</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T06:58:42</dcterms:modified>
</pgterms:file>
<pgterms:file rdf:about="http://www.gutenberg.org/files/1342/1342.zip">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">259663</dcterms:extent>
<dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/1342" />
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-08-21T07:00:26</dcterms:modified>
</pgterms:file>
</rdf:RDF>

For portability purposes, I've downloaded the zipped dump and extracted it to my collection's configuration folder ($SEARCH_HOME/conf/$COLLECTION/source).

By default, Funnelback doesn't index it's own installation folder for local collections - I'll need to disable this behaviour for this collection via collection.cfg (Administer > Edit Collection Settings > Indexer):

-check_url_exclusion=off -ifb

Looking at the example XML record from my collection, I've determined that there's only a handful of fields that I need indexing. Updates to xml.cfg look like:

PADRE XML Mapping Version: 2
t,1,,//dcterms:title
s,1,,//dcterms:subject/rdf:Description/rdf:value
e,0,,//dcterms:type/rdf:Description/rdf:value
d,0,,//dcterms:issued
l,0,,//dcterms:language
r,0,,//dcterms:rights
a,1,,//pgterms:agent/pgterms:name
f,0,,//pgterms:file/dcterms:format/rdf:Description/rdf:value
o,0,,//dcterms:isFormatOf@rdf:resource
g,0,,//dcterms:hasFormat@rdf:resource
I,0,,//pgterms:ebook/pgterms:marc901

Any updates to the xml.cfg file will require a reindexing before they take effect (Update > Start Advanced Update > Reindex Live View).

A few more modifications to my collection.cfg file are required to ensure that these fields are returned in the data model when I conduct a query. Further detail available from the Custom Summaries documentation

My final collection.cfg file looks like:

#
# Filename: /opt/funnelback/conf/project-gutenberg/collection.cfg
# Last Update: Thu Apr 24 14:17:46 2014 
#
click_tracking.restrict_redirects_to_existing_urls_and_fps=true
collection=project-gutenberg
collection_type=local
data_root=$SEARCH_HOME/conf/$COLLECTION_NAME/source
indexer_options=-check_url_exclusion=off -ifb
query_processor_options=-stem=2 -SM=meta -SF=acdfgiotI -countgbits=all
service_name=Project Gutenberg
ui_cache_link=/s/cache.html

Looking at the JSON output for my search query for 'pride and prejudice', I can see that all indexed fields are now coming back in my results' summaries:

/s/search.json?collection=project-gutenberg&query=pride%20and%20prejudice

...
results: [
{
rank: 1,
score: 1000,
title: "Pride and Prejudice",
collection: "project-gutenberg",
component: 0,
collapsed: null,
liveUrl: "file:///opt/funnelback/conf/project-gutenberg/source/cache/epub/42671/pg42671.rdf",
summary: null,
cacheUrl: "/s/cache.html?collection=project-gutenberg&doc=cache/epub/42671/pg42671.rdf&off=0&len=-1&url=file%3A%2F%2F%2Fopt%2Ffunnelback%2Fconf%2Fproject-gutenberg%2Fsource%2Fcache%2Fepub%2F42671%2Fpg42671.rdf&profile=_default_preview",
date: 1368021600000,
fileSize: 11840,
fileType: "xml",
tier: 1,
docNum: 2966,
exploreLink: null,
kmFromOrigin: null,
metaData: {
f: "text/html|application/zip|text/html|application/epub+zip|application/epub+zip|application/x-mobipocket-ebook|application/x-mobipocket-ebook|application/prs.plucker|application/x-qioo-ebook|text/plain|text/plain;",
g: "http://www.gutenberg.org/ebooks/42671.epub.images|http://www.gutenberg.org/ebooks/42671.epub.noimages|http://www.gutenberg.org/ebooks/42671.kindle.images|http://www.gutenberg.org/ebooks/42671.kindle.noimages|http://www.gutenberg.org/ebooks/42671.pluc",
d: "2013-05-09",
t: "Pride and Prejudice",
a: "Austen, Jane|Chapman, R. W. (Robert William)",
o: "ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671|ebooks/42671"
},
tags: [ ],
quickLinks: null,
displayUrl: "/opt/funnelback/conf/project-gutenberg/source/cache/epub/42671/pg42671.rdf",
clickTrackingUrl: "/s/redirect?rank=1&collection=project-gutenberg&url=file%3A%2F%2F%2Fopt%2Ffunnelback%2Fconf%2Fproject-gutenberg%2Fsource%2Fcache%2Fepub%2F42671%2Fpg42671.rdf&index_url=file%3A%2F%2F%2Fopt%2Ffunnelback%2Fconf%2Fproject-gutenberg%2Fsource%2Fcache%2Fepub%2F42671%2Fpg42671.rdf&auth=hTZZnzKI8emfGclXIjPyxg&query=pride+and+prejudice&profile=_default_preview",
explain: null,
indexUrl: "file:///opt/funnelback/conf/project-gutenberg/source/cache/epub/42671/pg42671.rdf",
customData: { }
},
...

Finally, some faceting would by Author, Subject, etc. would be nice. The following ended up in faceted_navigation.cfg:

<Facets qpoptions=" -rmcf=elrafs -count_dates=d">
  <Data></Data>
  <Facet>
    <Data>Author</Data>
    <MetadataFieldFill>
      <Data>a</Data>
    </MetadataFieldFill>
  </Facet>
  <Facet>
    <Data>Subject</Data>
    <MetadataFieldFill>
      <Data>s</Data>
    </MetadataFieldFill>
  </Facet>
  <Facet>
    <Data>Release Date</Data>
    <DateFieldFill>
      <Data>d</Data>
    </DateFieldFill>
  </Facet>
  <Facet>
    <Data>Language</Data>
    <MetadataFieldFill>
      <Data>l</Data>
    </MetadataFieldFill>
  </Facet>
  <Facet>
    <Data>Format</Data>
    <MetadataFieldFill>
      <Data>f</Data>
    </MetadataFieldFill>
  </Facet>
  <Facet>
    <Data>Licence</Data>
    <MetadataFieldFill>
      <Data>r</Data>
    </MetadataFieldFill>
  </Facet>
  <Facet>
    <Data>Type</Data>
    <MetadataFieldFill>
      <Data>e</Data>
    </MetadataFieldFill>
  </Facet>
</Facets>

jebio 2014-09-10 14:47:25 UTC #3

Hi gordongrace,

I followed your instructions but I can't get the funnelback to parse the XML files properly. It works as a normal search (ie. normal search outputs the XML files that contains that keyword.

However, instead of general search, I want to index the XML files just by one element. (Example: <FirstName>...</FirstName>). So let say if I search the word 'John', it should only return

the XML files with an element <FirstName>John</FirstName>. If the word 'John' appeared anywhere else within the XML file, it should not be indexed and should not be output on the search results.

One thing I've noticed as well, on you JSON output above, the system classifies your XML/RDF as

fileType: "xml",

However, on mine, even though all the files within the local collections are all valid XML files, the system classifies it as

fileType: "html",

Could that be the reason why no matter how I define it in the xml.cfg, it won't properly mapped to a meta data?

Sorry if I can't explain it in better terms. It's my first time using the FunnelBack system.

Regards,

Joen

jebio 2014-09-11 11:36:29 UTC #4

I found out what's wrong now. Even though the file had the .XML extension and contains a valid XML contents, i'm missing the following lines on the top of my XML files.

<?xml version="1.0" encoding="UTF-8"?>

Without this, the funnelback system was treating the XML files as HTML. Therefore, meta tags mapping won't work.

gordongrace 2014-09-17 16:31:57 UTC #5

Good pickup, jebio.

Even if that line wasn't present in your XML, you could also force Funnelback's indexer to treat all plaintext content as XML (and use xml.cfg instead of metamap.cfg):

#collection.cfg
indexer_options=-forcexml

It's also possible that the XML may have been retrieved from a webserver that incorrectly reported the document's MIMEtype.

jebio 2014-09-22 14:55:23 UTC #6

Good pickup, jebio.

Even if that line wasn't present in your XML, you could also force Funnelback's indexer to treat all plaintext content as XML (and use xml.cfg instead of metamap.cfg):
#collection.cfg
indexer_options=-forcexml
It's also possible that the XML may have been retrieved from a webserver that incorrectly reported the document's MIMEtype.

See also:

http://docs.funnelback.com/indexer_options_collection_cfg.html#C.%20Controlling%20how%20things%20are%20indexed

Thanks for the suggestion gordongrace. I've got another XML related question,

Is there anyway I can change the template for the generated search.xml?

What i want to do is, instead of parsing the meta data from search.xml as shown below,

..
<metaData>
<entry>
<string>T</string>
<string>Prof</string>
</entry>
<entry>
<string>W</string>
<string>john-smith</string>
</entry>
<entry>
<string>F</string>
<string>John</string>
</entry>
<entry>
<string>S</string>
<string>123456</string>
</entry>
<entry>
<string>L</string>
<string>Smith</string>
</entry>
<entry>
<string>J</string>
<string>Sample Job</string>
</entry>
</metaData>
..

I want to modify my search.xml output so the meta data will be display as

<title>Prof</title>
<webid>john-smith</webid>
<firstname>John</firstname>
<lastname>Smith</lastname>
<id>123456</id>
<job>Sample Job</job>

Note: meta data will be based on the one I defined within xml.cfg

Thanks for helping out. Hope my question made any sense.

gordongrace 2014-09-23 08:16:14 UTC #7

Hi Jebio -

Output produced by /s/search.xml will always conform to Funnelback's XML data model for a given version of Funnelback. Changes to xml.cfg will only affect which fields are indexed, rather than altering the structure of the XML output.

You may be better off using an approach like the following:

Ensure all indexable XML from your source data is mapped using xml.cfg (it sounds as though you've already done this)

Create a new search form (xml_output.ftl) that outputs XML in your desired syntax (some work with Freemarker templating will be required)

Test the output using /s/search.html?form=xml_output

Ensure that the 'xml_output' form is reported as an XML content type:
```
#collection.cfg
ui.modern.form.xml_output.content_type=text/xml
```
```
</li>
```

Note that your xml_output template would probably be a very stripped-back version of the simple.ftl form, possibly excluding facets, pagination, etc.

See:

http://docs.funnelback.com/ui_modern_form_content_type_collection_cfg.html

http://docs.funnelback.com/search_forms.html

jebio 2014-09-23 11:12:14 UTC #8

Hi Jebio -

Output produced by /s/search.xml will always conform to Funnelback's XML data model for a given version of Funnelback. Changes to xml.cfg will only affect which fields are indexed, rather than altering the structure of the XML output.

You may be better off using an approach like the following:
Ensure all indexable XML from your source data is mapped using xml.cfg (it sounds as though you've already done this)

Create a new search form (xml_output.ftl) that outputs XML in your desired syntax (some work with Freemarker templating will be required)

Test the output using /s/search.html?form=xml_output
Ensure that the 'xml_output' form is reported as an XML content type:
#collection.cfg
ui.modern.form.xml_output.content_type=text/xml
</li>
indexer_options=-check_url_exclusion=off -ifb -forcexml -RSDTF2000 -RSTXT2000 -big10
Can you please advise? thanks :)

gordongrace 2014-09-23 12:57:45 UTC #9

By default, Funnelback will attempt to index very long words (see the -dilw indexer option).

If you're seeing the summary being truncated, you may need to adjust some of the presentation options at query time

I'd experiment with:

#collection.cfg
query_processor_options=-SBL=1024 -MBL=1024 ...

These will also work as CGI parameters.

See:

http://docs.funnelback.com/query_processor_options_collection_cfg.html#F.%20Presentation%20options

http://docs.funnelback.com/custom_summaries.html

jebio 2014-09-23 15:14:35 UTC #10

By default, Funnelback will attempt to index very long words (see the -dilw indexer option).

If you're seeing the summary being truncated, you may need to adjust some of the presentation options at query time

I'd experiment with:
#collection.cfg
query_processor_options=-SBL=1024 -MBL=1024 ...
These will also work as CGI parameters.

See:

http://docs.funnelback.com/query_processor_options_collection_cfg.html#F.%20Presentation%20options

http://docs.funnelback.com/custom_summaries.html

Hi gordongrace,

thanks for your support. that -SBL=1024 -MBL=1024 made a difference, the parse meta data increased from 247 characters to 494 characters. However, that's the limit, even if I increase them to -SBL=2000 -MBL=2000, it's not exceeding the 494 meta-data character limit. The meta data i'm trying to parse is 1483 characters.

Here's my collection.cfg so far

crawler.parser.mimeTypes=text/xml
click_tracking.restrict_redirects_to_existing_urls_and_fps=true
collection=test-directory
collection_type=local
data_root=$SEARCH_HOME/test/test/test/test
indexer_options=-check_url_exclusion=off -ifb 
query_processor_options=-stem=2 -SM=meta -SF=TFLRUPDY -SBL=1024 -MBL=1024
service_name=Test Directory
ui_cache_link=/search/cache.cgi
ui.modern.form.simple.content_type=text/xml

Adding the following doesn't help either

indexer_options=-RSDTF2000 -RSTXT2000

Can you still think of any other thing that I need to set? Thanks a lot.

gordongrace 2014-09-23 19:10:29 UTC #11

If you can post the XML snippet that you're having trouble with, it might assist further.

It sounds as though you're having issues with the field length, rather than a word length?

Is there a non-standard character in the feed that might be causing invalid XML - have you tried enclosing the fields in CDATA tags?

jebio 2014-09-24 08:43:33 UTC #12

Hi gordongrace,

I've attached 3 files:

1. the xml i'm trying to parse

2. the xml_output.ftl

3. the test output using /s/search.html?form=xml_output

Thanks :)

Files.zip (2.21 KB)

gordongrace 2014-09-25 15:18:47 UTC #13

Thanks, Jebio -

I've been able to produce your desired behaviour for that field (using only the input XML file and the default Funnelback JSON output from /s/search.json).

The setting you'll need to alter is:

#collection.cfg
indexer_options=-check_url_exclusion=off -ifb -forcexml -mdsfml4096

Note the 'mdsfml' (MetaData Summary Field Maximum Length) setting, increased from its default to 4096 characters.

Your template looks fine - the query processor options for displaying that maximum metadata field length should probably be brought into alignment with the Maximum Metadata Field length.

jebio 2014-09-25 15:30:55 UTC #14

Thanks gordongrace! That worked perfectly! I almost devour the whole documentation trying almost every logical parameter and configs. For some strange reason, I never tried the -mdsfml because on the docs it says "Set maximum length for strings in .mdsf file" and I never seen any .msdf file so I thought it's irrelevant. Didn't realised mdsfml means (MetaData Summary Field Maximum Length). You're brilliant! Thanks a lot for you help and patience on this. :)

gordongrace 2016-12-01 11:25:19 UTC #15