Include a summary field in results with highlighted keyword matches?

Using the search.json API I’d like it configured so that each results has a summary snippet, and if possible within the snippets are the keyword matches (highlighted?)

For example:

GET search.json?query=physics

...
"summary": "This is a vocational course in applied **physics** for anyone with a background in the **physics** or engineering...
...

I’ve been checking this page in the docs - Search result summaries - Funnelback Documentation - Version 15.24.0

Here are the query processor options we have set for the collection:
-stem=2 -SM=meta -SF=[a,c,d,9,Z,A,B,S,M,D,T,C,L,E,l,U,V,X,j,k,m]

To my understanding, as we have -SM=meta then this will only show metaData fields? So with this setting each result has "summary": null
If I change to something like -SM=qb I don’t see null anymore, but just an empty string e. "summary": ""

Is there anything else I need to configure?

When indexing, we’re also using an XML file containing all the fields we want to crawl for this collection, and these are mapped in xml.cfg. These are all present in metaData.* properties.

There are two separate issues here.

  1. Highlighting is normally handled within the UI layer - so if you’re using the JSON endpoint you’d be expected to implement the highlighting yourself when you present the data, using the query field as your highlight terms.

There is a query processor option where strong tags can be returned in the data model, though I would recommend you handle the highlighting yourself as described above. - see the SHLM and mdsfhl options at: Padre Query Processor Options - Funnelback Documentation - Version 15.18.0

  1. If you wish to have summaries and also have access to the metadata fields you should use SM=both (instead of SM=meta). You should check your metadata mappings though as only fields that have a search behaviour of ‘searchable as content’ (15.14 or newer - See: Metadata classes and mapping - Funnelback Documentation - Version 15.18.0) or are setup as type 1 metadata in the metamap.cfg (15.12 and earlier metamap.cfg - Funnelback Documentation - Version 15.12.0) or xml.cfg (15.12 and earlier xml.cfg - Funnelback Documentation - Version 15.12.0)

You also should check the special XML field for unmapped content. In 15.14 and newer this is controlled via the XML documents configuration.

In 15.12 and earlier this is the ‘-’ special field that’s included in the xml.cfg (xml.cfg - Funnelback Documentation - Version 15.12.0). This field controls the behaviour of how unmapped fields are treated (and if they are indexed as unfielded document content).

1 Like

Thanks for the reply.

I’ve changed -SM to both (instead of meta) and this gives me a summary that’s not null, but just an empty string e.g. "summary": "" even though the search results appear relevant to the query so I imagine they’re being indexed properly on update.

Below is our xml.cfg file:

PADRE XML Mapping Version: 2
t,1,,//title
document,/courses/course
docurl,/courses/course/url
L,1,,//level
S,1,,//subject
D,1,,//distancelearning
j,1,,//parttime
k,1,,//degreepreparation
m,1,,//overview
c,1,,//content
M,1,,//faculty

Notice that all of these are set to searchable content (1) now.

Below is our XML schema:

<courses>
    <course>
        <title>Accounting</title>
        <url>
            https://www...ac.uk/courses/undergraduate/accounting/
        </url>
        <overview>
            <p>Studying accounting degree with us... blah blah blah</p>
        </overview>
        <content>
            <p>The Development Programme is a core element... blah blah blah</p>
        </content>
        <level>Undergraduate</level>
        <faculty>business school</faculty>
        <subject>Accounting and finance</subject>
        <distancelearning/>
        <parttime/>
        <degreepreparation>Y</degreepreparation>
    </course>
    <course>
        <title>Business Analysis</title>
        ...

I’m kinda expecting something from one of these fields to be in the summary, is that right?

I’m also seeing another strange thing that I maybe ought to open a new ticket for, not sure if it’s anything to do with these changes though - in the JSON, our content and overview mata data is being truncated and I’m not sure why:

{
	"question": {...},
	"response": {
		"resultPacket": {
			...
			"results": [{
						"rank": 1,
						"score": 1000,
						"title": "Art & Design",
						"collection": "uos-courses-xml",
						"component": 0,
						"collapsed": null,
						"liveUrl": "https://www...ac.uk/courses/pg/educationartdesign/",
						"summary": "",
						"cacheUrl": "/s/cache?collection=uos-courses-xml&doc=funnelback-web-crawl.warc&off=784948&len=7100&url=https%3A%2F%2Fwww...ac.uk%2Fcourses%2Fpostgraduatetaught%2Fsecondaryeducationartdesign%2F&profile=_default_preview",
						"date": null,
						"fileSize": 0,
						"fileType": "txt",
						"tier": 1,
						"docNum": 404,
						"exploreLink": null,
						"kmFromOrigin": null,
						"quickLinks": null,
						"displayUrl": "https://www...ac.uk/courses/pg/educationartdesign/",
						"clickTrackingUrl": "/s/redirect?collection=uos-courses-xml&url=https%3A%2F%2Fwww...ac.uk%2Fcourses%2Fpostgraduatetaught%2Fsecondaryeducationartdesign%2F&index_url=https%3A%2F%2Fwww...ac.uk%2Fcourses%2Fpostgraduatetaught%2Fsecondaryeducationartdesign%2F&auth=YgI0FkLuoXYncMjEfBW9cg&profile=_default_preview&rank=1&query=arts",
						"explain": null,
						"indexUrl": "https://www...ac.uk/courses/pg/educationartdesign/",
						"gscopesSet": [],
						"documentVisibleToUser": true,
						"promoted": false,
						"diversified": false,
						"metaData": {
							"c": "<p>As part of the course, you’ll have the opportunity to submit two of your assignments at Masters level. If you do this, you’ll be almost halfway towards a Masters in Education qualification. If you go on to do the Masters in Education, all",
							"S": "Education,Teaching",
							"k": "Y",
							"L": "Postgraduate taught",
							"m": "<p>As a teacher of art and design, you’ll be responsible for uncovering the creative spark in each and every student you teach. You’ll open young people’s eyes to endless creative possibilities and show them new means by which they can express"
						},
						"tags": [],
						"customData": {}
					},
					{
                        ...

The full text for these fields is present in the XML feeds, but I’m not sure why they are truncated by Funnelback. Anything obvious in my configuration that might be causing this?

Thanks

Funnelback will automatically truncate metadata fields over a certain size. This limit can be increased by setting the following indexer option:

-mdsfml<n>
Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048.

There should be some messages in the Step-Index.log indicating that metadata has been truncated (and possibly provide the length encountered (depending on which version you’re running).

It’s also possible that the field is being fully indexed but truncated at display time - there is a metadata buffer and if this fills up metadata will be truncated (though it’s more likely to be the previous setting above).

The metadata buffer length is set using the MBL query processor option:

The lack of summary is a bit odd. I’ll ask around about that, but is there a reason why you’d prefer to use the summary value instead of targeting a specific metadata field? Summaries are generally a fallback if you don’t have a decent description to use because they will often make a lot less sense than presenting a properly crafted description (due to the fact that the summary text could come from a number of places).

  1. mdsfml

We certainly haven’t set these options so it ought to be using the defaults. However, just to see if it makes any difference, if I try setting this in config to -mdsfml=10000 then the update fails. If I simply try setting this to the default (2048) then it still fails. Below is a snippet from collection.cfg:

changeover_percent=0
...
query_processor_options=-stem=2 -SM=both -SF=[a,c,d,9,Z,A,B,S,M,D,T,C,L,E,l,U,V,X,j,k,m] -mdsfml=2048

(by the way, is -mdsfml=2048 the correct format? looking at the docs it’s written as -mdsfml so I even tried -mdsfml2048 but again it fails)

  1. MBL

BUT, I think your second suggestion appears to work as when I set -MBL=1000 there is no longer any truncation on the content fields in the meta data. I can monitor this here should we need to alter this value. Also, I can see it working when I add/remove the option and run Update So progress there, thanks.

  1. Summary field

Regarding the summary field, what I ultimately want to display is a snippet of the text relevant to the search term. See this screenshot - http://rtyn.biz/staticfiles/summary.png - notice that it shows multiple snippets and inserts “…” between them; “arts” is highlighted, but so is “art” (singular) - this would be slightly bothersome to handle in the UI. However, this is a screenshot of a Freemarker template not the API which I intend to use. But looking at the template I can see it’s using the summary field (as well as bolderize method which we won’t have, but even if the API could provide strong tags which -SHLM option might provide(?) would be helpful). So would be really good if we can get this working. Anyway, hope that gives an explanation of what we’re aiming for.

mdsfml is an indexer setting that controls the size of index allocated to storing metadata. This is set in the collection.cfg as an indexer_option which sets various options that are applied when the index is built. e.g. indexer_options= -mdsfml10000

MBL is a query time setting that affects how large a buffer is used for the display of metadata. This is set in collection.cfg as a query_processor_option. If MBL has fixed your issue then that’s great.

I have spoken with the developers and it sounds like metadata isn’t used as a source for the query biased summaries so your only option might be to try the SHLM query processor option, and failing that write your own function to perform the snippeting of the desired field. I’ll raise an product improvement ticket about that but it’s unlikely to be looked at any time soon.