ttran
September 11, 2017, 4:34pm
1
What would be the best way to grab the lang attribute from a HTML document. i.e.
<html lang="en"
I’m guessing that metamap.cfg classes don’t pick up on the HTML tag, so
l,0,lang wouldn’t work.
I’ve also tried the following settings for the metadata scraper, and tried pointing it to the HTML tag,
[{
“urlRegex”: “http://www \.site\.org”,
“metadataName”: “dc.language”,
“elementSelector”: “html”,
“applyIfNoMatch”: false,
“extractionType”: “attr”,
“attributeName”: “lang”
}]
But no luck with this either.
plevan
September 11, 2017, 10:23pm
2
Hey Tim,
The metadata scraper can definitely be used for this.
the language is correctly extracted from:
https://www.theguardian.com/au
into a field called fb.language
using the following metadata scraper configuration.
[{
"urlRegex": "theguardian\\.com/",
"metadataName": "fb.language",
"elementSelector": ":root",
"attributeName": "lang",
"applyIfNoMatch": false,
"extractionType": "attr",
"processMode": "regex",
"value": "(.+)",
"description": "Get language from document"
}]
Your code might not be working for the following reasons:
I’m not sure but I think the processMode/value params are needed to define what’s extracted.
Check that the URL is indeed matching (note the double escaping of backslashes), and make sure the URL matches what you are storing.
Check to make sure you’re actually calling the metadata scraper (you have to add it to your Jsoup filter chain).
1 Like
ttran
September 13, 2017, 9:12pm
3
Thanks Pete!
We’ve updated our scraper settings. This is how we’re referencing the scraper in our filter.classes settings in the collection.cfg,
filter.classes=FAChecker:CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter :com.funnelback.services.filter.TypeFilter
Is this correct?
I get the feeling the scraper is not being properly referenced, here’s our scraper settings,
[{
“urlRegex”: “truebluenaturalgas\.org/”,
“metadataName”: “dc.language”,
“elementSelector”: “:root”,
“attributeName”: “lang”,
“applyIfNoMatch”: false,
“extractionType”: “attr”,
“processMode”: “regex”,
“value”: “(.+)”,
“description”: “Get language from document”,
“applyIfNoMatch”: true,
“value”: “en”
}]
It should be applying en to the l:dc.language class metadata, but nothing appears to be coming through.
Hi @ttran -
If you’re using Funnelback v15.8+, you’ll need to ensure you have a line that looks like the following in collection.cfg
(note the addition of MetadataScraper
at the end):
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,MetaDataScraper
This assumes that the JSoupFilterProvider
has already been referenced in the filter.classes
line in collection.cfg
. The default value is:
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
Refer:
1 Like
ttran
September 25, 2017, 3:31pm
5
Thanks GG! Thanks Pete!
Final filter settings I ended up using were,
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.TypeFilter:com.funnelback.services.filter.MetaDataScraperFilter
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,MetadataScraper
And the metadata_scraper.json format ended up being,
[{
“urlRegex”: “www\.truebluenaturalgas\.org/”,
“metadataName”: “fb.lang”,
“elementSelector”: “:root”,
“attributeName”: “lang”,
“applyIfNoMatch”: false,
“extractionType”: “attr”,
“processMode”: “regex”,
“value”: “(.+)”,
“description”: “Get language from document”
}]
I ran into a few hiccups trying to use a metadata name which was already being indexed, i.e. “dc.language”, but using a new separate fb.lang one, did the trick.
So in my metamap.cfg, I had,
fbLang,0,fb.lang - and I reference the HTML lang property using fbLang via a query gscope.