Hi Robin,
To get this working please do the following:
1) Save the MetaDataScraperFilter.groovy file which I've pasted in this Gist.
2) Place that file in $SEARCH_HOME/lib/java/groovy/com/funnelback/services/filter/
3) In your collection.cfg file, reference the filter by changing the filter.classes to
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter:TextMiner
4) The next step is to create a file called filter.metadata-scraper.cfg under the collection root folder (ie. $SEARCH_HOME/conf/$COLLECTION_NAME/filter.metadata-scraper.cfg)
5) You will then need to populate the comma-delimited file with the rules in the following format:
<url>,<meta name>,<selector>,<extraction type>,<attribute name>,<meta value type>,<value>
where
-
<url>: The url pattern to apply the rule to as a regex.
-
<meta name>: The name of the new meta data
-
<selector>: The css style selector in which to obtain the contents
-
<extraction type>: Either text or attr.
- text - Instructs the script to extract all content in between the tags. e.g <div>text to be extracted</div>. Note: if text is selected, you will need to specify a blank <attribute name>
- attr - Instructsthe script to extract the contents from an attribute. e.g. <a href="text to be extracted">
- html - Instructs the script to extract the HTML contents of the tag. This is useful in order to assign full html markup to meta data. Please note that all single quotes and double will be replaced with ' and " respectively.
-
<attribute name>: This is required if attr is selected from extraction type. It determines which attribute the scripts will look at in order to extract the content.
-
<meta value type>: Either regex or constant.
- regex: Specifies that the <Value> will be a regular expression where the contents that is to be extracted is the all "groups". i.e. Given the text "I am human and canine" and the regex "I am (human) and (canine)", the extracted value will be humancanine. You can also ignore specific groups by using the non-capturing syntax of "?:" i.e. Given the text "I am human and canine" and the regex "I am (?:human) and (canine)", the extracted value will be canine.
- constant: A hardcoded value
-
<value>: either a regex or constant depending on what is specified for <meta value type>
Example filter.metadata-scraper.cfg
#Super cheap auto scraper configurations
supercheapauto\.com\.au/online-store/products,fb.name,h1[itemprop=name],tExt,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.price,span[itemprop=price],Text,,regex,\$(.+)
supercheapauto\.com\.au/online-store/products,fb.product.id,div[itemprop=productID],text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.img,img[itemprop=image ad aa sda],attr,src,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.description,ul[itemprop=description] li,Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.category1,div#breadcrumbContainer a.breadcrumbNoLink,Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.category2,"#breadcrumbContainer .breadcrumb:not(.breadcrumbNoLink, .breadcrumbHome)",Text,,regex,(.+)
supercheapauto\.com\.au/online-store/products,fb.type,meta[name=fb.price],Text,,constant,product
Produces:
<meta name="fb.name" content="SCA Trolley Jack - Hydraulic, 1400kg" />
<meta name="fb.price" content="49.99" />
<meta name="fb.product.id" content="PLU 215072" />
<meta name="fb.description" content="1400kg Working load limit|Height range: 140 - 335mm|Meets AS/NZS 2615:2004 standard" />
<meta name="fb.type" content="product" />
The rules specified in filter.metadata-scraper.cfg are executed in the order that they are specified. This makes it possible for rules to refer to any meta data that was added by a preceding rule.
Hope this helps!