Assign to metaclass based on a html's id attribute

rbilney 2016-03-09 14:51:02 UTC #1

Hi,

I have created a web collection that indexes another site's news stories however they only really have one <meta> tag for description. Can I use the Metamap.cfg to assign content based on a html tag's ID attribute to that meta class?

HTML:

<title>News headline</title>

<h1>News headline</h1>

<div id="article_date">Published Date</div>

Metamap.cfg:

t,1,title

c,1,description

d,1,<div@article_date>

I,1,<img@article_image>

Thanks,

Robin

***Edit: should be Assign in title not Assig***

Dani 2016-03-21 10:14:11 UTC #2

Hi Robin, what version of Funnelback are you using?

rbilney 2016-03-21 10:17:12 UTC #3

Hi Dani,

We are on 13.2.0

Dani 2016-03-22 14:29:36 UTC #4

Hello Robin,

From the documentation (https://docs.funnelback.com/13.2/metamap_cfg.html) it does indeed seem logical that one would be able to do:

d,1,<div#article_date>
I,1,<img#article_image>

however, I don't think this is possible. I will look into this further and see what I can find

Dani 2016-03-23 09:45:23 UTC #5

Hi Robin,

To get this working please do the following:

1) Save the MetaDataScraperFilter.groovy file which I've pasted in this Gist.

2) Place that file in $SEARCH_HOME/lib/java/groovy/com/funnelback/services/filter/

3) In your collection.cfg file, reference the filter by changing the filter.classes to

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter:TextMiner

4) The next step is to create a file called filter.metadata-scraper.cfg under the collection root folder (ie. $SEARCH_HOME/conf/$COLLECTION_NAME/filter.metadata-scraper.cfg)

5) You will then need to populate the comma-delimited file with the rules in the following format:

where

<url>: The url pattern to apply the rule to as a regex.

<meta name>: The name of the new meta data

<selector>: The css style selector in which to obtain the contents

<extraction type>: Either text or attr.
- text - Instructs the script to extract all content in between the tags. e.g <div>text to be extracted</div>. Note: if text is selected, you will need to specify a blank <attribute name>
- attr - Instructsthe script to extract the contents from an attribute. e.g. <a href="text to be extracted">
- html - Instructs the script to extract the HTML contents of the tag. This is useful in order to assign full html markup to meta data. Please note that all single quotes and double will be replaced with ' and " respectively.

<attribute name>: This is required if attr is selected from extraction type. It determines which attribute the scripts will look at in order to extract the content.

<meta value type>: Either regex or constant.
- regex: Specifies that the <Value> will be a regular expression where the contents that is to be extracted is the all "groups". i.e. Given the text "I am human and canine" and the regex "I am (human) and (canine)", the extracted value will be humancanine. You can also ignore specific groups by using the non-capturing syntax of "?:" i.e. Given the text "I am human and canine" and the regex "I am (?:human) and (canine)", the extracted value will be canine.
- constant: A hardcoded value

<value>: either a regex or constant depending on what is specified for <meta value type>

Example filter.metadata-scraper.cfg

#Super cheap auto scraper configurations

supercheapauto\.com\.au/online-store/products,fb.name,h1[itemprop=name],tExt,,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.price,span[itemprop=price],Text,,regex,\$(.+)

supercheapauto\.com\.au/online-store/products,fb.product.id,div[itemprop=productID],text,,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.img,img[itemprop=image ad aa sda],attr,src,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.description,ul[itemprop=description] li,Text,,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.category1,div#breadcrumbContainer a.breadcrumbNoLink,Text,,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.category2,"#breadcrumbContainer .breadcrumb:not(.breadcrumbNoLink, .breadcrumbHome)",Text,,regex,(.+)

supercheapauto\.com\.au/online-store/products,fb.type,meta[name=fb.price],Text,,constant,product

Produces:

The rules specified in filter.metadata-scraper.cfg are executed in the order that they are specified. This makes it possible for rules to refer to any meta data that was added by a preceding rule.

Hope this helps!