Assign to metaclass based on a html's id attribute

Hi,

 

I have created a web collection that indexes another site's news stories however they only really have one <meta> tag for description. Can I use the Metamap.cfg to assign content based on a html tag's ID attribute to that meta class?

 

HTML:

 

<title>News headline</title>

<meta name="description" content="Description of the news story">

 

<h1>News headline</h1>

<div id="article_date">Published Date</div>

<img id="article_image" />

 

 

Metamap.cfg:

 

t,1,title

c,1,description

d,1,<div@article_date>

I,1,<img@article_image>

 

Thanks,

Robin

 

***Edit: should be Assign in title not Assig***

Hi Robin, what version of Funnelback are you using?

Hi Dani,

 

We are on 13.2.0

Hello Robin,

 

From the documentation (https://docs.funnelback.com/13.2/metamap_cfg.html) it does indeed seem logical that one would be able to do:

 

d,1,<div#article_date> I,1,<img#article_image>

 

however, I don't think this is possible. I will look into this further and see what I can find

Hi Robin,

 

To get this working please do the following:

 

1) Save the MetaDataScraperFilter.groovy file which I've pasted in this Gist.

2) Place that file in $SEARCH_HOME/lib/java/groovy/com/funnelback/services/filter/

3) In your collection.cfg file, reference the filter by changing the filter.classes to

 

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter:TextMiner

 

4) The next step is to create a file called filter.metadata-scraper.cfg under the collection root folder (ie. $SEARCH_HOME/conf/$COLLECTION_NAME/filter.metadata-scraper.cfg)

5) You will then need to populate the comma-delimited file with the rules in the following format:

 

<url>,<meta name>,<selector>,<extraction type>,<attribute name>,<meta value type>,<value>
 

where

  • <url>: The url pattern to apply the rule to as a regex.
  • <meta name>: The name of the new meta data
  • <selector>: The css style selector in which to obtain the contents
  • <extraction type>: Either text or attr.
    • text - Instructs the script to extract all content in between the tags. e.g  <div>text to be extracted</div>. Note: if text is selected, you will need to specify a blank <attribute name>
    • attr - Instructsthe script to extract the contents from an attribute. e.g. <a href="text to be extracted">
    • html - Instructs the script to extract the HTML contents of the tag. This is useful in order to assign full html markup to meta data. Please note that all single quotes and double will be replaced with &#39; and &#34; respectively.
  • <attribute name>: This is required if attr is selected from extraction type. It determines which attribute the scripts will look at in order to extract the content.
  • <meta value type>: Either regex or constant.
    • regex: Specifies that the <Value> will be a regular expression where the contents that is to be extracted is the all "groups". i.e. Given the text "I am human and canine" and the regex "I am (human) and (canine)", the extracted value will be humancanine. You can also ignore specific groups by using the non-capturing syntax of "?:" i.e. Given the text "I am human and canine" and the regex "I am (?:human) and (canine)", the extracted value will be canine.
    • constant: A hardcoded value
  • <value>: either a regex or constant depending on what is specified for <meta value type>

 

Example filter.metadata-scraper.cfg

 

#Super cheap auto scraper configurations
supercheapauto\.com\.au/online-store/products,fb.name,h1[itemprop=name],tExt,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.price,span[itemprop=price],Text,,regex,\$(.+)
supercheapauto\.com\.au/online-store/products,fb.product.id,div[itemprop=productID],text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.img,img[itemprop=image ad aa sda],attr,src,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.description,ul[itemprop=description] li,Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.category1,div#breadcrumbContainer a.breadcrumbNoLink,Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.category2,"#breadcrumbContainer .breadcrumb:not(.breadcrumbNoLink, .breadcrumbHome)",Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.type,meta[name=fb.price],Text,,constant,product
 
Produces:
<meta name="fb.name" content="SCA Trolley Jack - Hydraulic, 1400kg" />
<meta name="fb.price" content="49.99" />
<meta name="fb.product.id" content="PLU 215072" />
<meta name="fb.description" content="1400kg Working load limit|Height range: 140 - 335mm|Meets AS/NZS 2615:2004 standard" />
<meta name="fb.type" content="product" />

 

The rules specified in filter.metadata-scraper.cfg are executed in the order that they are specified. This makes it possible for rules to refer to any meta data that was added by a preceding rule.

 

Hope this helps!