Prevent Facebook comments from being indexed

Hi. We have what I’d call a standard Funnelback collection set up for a Facebook page. An issue we’re having is user comments are being indexed for each post on the Facebook page. Ideally, it would be just the contents of the post that are being indexed, and not the user comments, names of commenters etc,. How can we prevent these from being indexed, given we can’t control the Facebook HTML structure and use the noindex tags?

Hi @bewilderbeest -

The scenario you’re describing (selectively indexing regions from external web resources) isn’t unique to Facebook content - the built-in InjectNoIndexFilterProvider should assist you here:

Refer:

A full regather/refilter will be required, once you’ve got your URL RegEx and CSS selector correct.

Thanks @gordongrace for the info.

This is what I’ve added to the collection.cfg file:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
filter.noindex.1=.* script,form
filter.noindex.2=div.UFIComment

And I’ve updated the collection but still am having problems with comments being included within the search.

Is there anything I’ve missed or do you have any documentation specific to Facebook collections?

If this is just a web collection (standard seed URLs, include/exclude patterns), then the InjectNoIndexFilters will be run against that gathered HTML content. You can view the cached copies of filtered content to confirm whether (and where) the NOINDEX tags are being injected.

If you’re using a Facebook Collection, there’s additional options controlled by the xml.cfg file to indicate which fields being gathered from the Facebook API are to be indexed.

See also: