Thanks GG! Thanks Pete!
Final filter settings I ended up using were,
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.TypeFilter:com.funnelback.services.filter.MetaDataScraperFilter
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,MetadataScraper
And the metadata_scraper.json format ended up being,
[{
"urlRegex": "www\.truebluenaturalgas\.org/",
"metadataName": "fb.lang",
"elementSelector": ":root",
"attributeName": "lang",
"applyIfNoMatch": false,
"extractionType": "attr",
"processMode": "regex",
"value": "(.+)",
"description": "Get language from document"
}]
I ran into a few hiccups trying to use a metadata name which was already being indexed, i.e. "dc.language", but using a new separate fb.lang one, did the trick.
So in my metamap.cfg, I had,
fbLang,0,fb.lang - and I reference the HTML lang property using fbLang via a query gscope.