1 / 5
Sep 2017

What would be the best way to grab the lang attribute from a HTML document. i.e.
<html lang="en"

I'm guessing that metamap.cfg classes don't pick up on the HTML tag, so
l,0,lang wouldn't work.

I've also tried the following settings for the metadata scraper, and tried pointing it to the HTML tag,

[{
"urlRegex": "http://www\.site\.org",
"metadataName": "dc.language",
"elementSelector": "html",
"applyIfNoMatch": false,
"extractionType": "attr",
"attributeName": "lang"
}]

But no luck with this either.

  • created

    Sep '17
  • last reply

    Sep '17
  • 4

    replies

  • 7.1k

    views

  • 3

    users

  • 2

    likes

  • 3

    links

Hey Tim,

The metadata scraper can definitely be used for this.

the language is correctly extracted from:
https://www.theguardian.com/au

into a field called fb.language using the following metadata scraper configuration.

[{
  "urlRegex": "theguardian\\.com/",
  "metadataName": "fb.language",
  "elementSelector": ":root",
  "attributeName": "lang",
  "applyIfNoMatch": false,
  "extractionType": "attr",
  "processMode": "regex",
  "value": "(.+)",
  "description": "Get language from document"
}]

Your code might not be working for the following reasons:

  1. I'm not sure but I think the processMode/value params are needed to define what's extracted.
  2. Check that the URL is indeed matching (note the double escaping of backslashes), and make sure the URL matches what you are storing.
  3. Check to make sure you're actually calling the metadata scraper (you have to add it to your Jsoup filter chain).

Thanks Pete!

We've updated our scraper settings. This is how we're referencing the scraper in our filter.classes settings in the collection.cfg,

filter.classes=FAChecker:CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter:com.funnelback.services.filter.TypeFilter

Is this correct?

I get the feeling the scraper is not being properly referenced, here's our scraper settings,

[{
"urlRegex": "truebluenaturalgas\.org/",
"metadataName": "dc.language",
"elementSelector": ":root",
"attributeName": "lang",
"applyIfNoMatch": false,
"extractionType": "attr",
"processMode": "regex",
"value": "(.+)",
"description": "Get language from document",
"applyIfNoMatch": true,
"value": "en"
}]

It should be applying en to the l:dc.language class metadata, but nothing appears to be coming through.

Hi @ttran -

If you're using Funnelback v15.8+, you'll need to ensure you have a line that looks like the following in collection.cfg (note the addition of MetadataScraper at the end):

filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,MetaDataScraper

This assumes that the JSoupFilterProvider has already been referenced in the filter.classes line in collection.cfg. The default value is:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Refer:
https://docs.funnelback.com/15.8/more/extra/metadata-scraper.html5
https://docs.funnelback.com/15.8/more/extra/filter_jsoup_classes_collection_cfg.html1
https://docs.funnelback.com/15.8/develop/programming-options/custom-filters.html#built-in-filters1

11 days later

Thanks GG! Thanks Pete!

Final filter settings I ended up using were,

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.TypeFilter:com.funnelback.services.filter.MetaDataScraperFilter
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,MetadataScraper

And the metadata_scraper.json format ended up being,

[{
"urlRegex": "www\.truebluenaturalgas\.org/",
"metadataName": "fb.lang",
"elementSelector": ":root",
"attributeName": "lang",
"applyIfNoMatch": false,
"extractionType": "attr",
"processMode": "regex",
"value": "(.+)",
"description": "Get language from document"
}]

I ran into a few hiccups trying to use a metadata name which was already being indexed, i.e. "dc.language", but using a new separate fb.lang one, did the trick.

So in my metamap.cfg, I had,

fbLang,0,fb.lang - and I reference the HTML lang property using fbLang via a query gscope.