1 / 3
Jun 2020

Hi,

I'm currently specifying a build and I'm wondering if it is possible to crawl an RSS feed. I've previously crawled XML and I just want to make sure that the same method can be used as RSS is a similar format to XML.

Thanks Michael

  • created

    Jun '20
  • last reply

    Jun '20
  • 2

    replies

  • 3.5k

    views

  • 3

    users

Hi Michael,

RSS is a dialect of XML, so it's actually treated the same as XML by Funnelback. The same process can be used as XML by using the 'XML Processing' and 'Configure Metadata Mappings' options in the Admin Interface.

In addition to the previous comments:

  • RSS feeds can commonly be detected by Funnelback as html because the web server delivers them with incorrect headers. If this is the case you will need to make sure your web server delivers the correct Content Type headers.
  • If you wish links to be extracted and followed from the RSS feed you'll need to ensure that the crawler.parser.mimeTypes includes the appropriate mime types for your RSS and that you update the crawler.link_extraction_regular_expression and crawler.link_extraction_group setting to also identify URLs stored in the RSS (usually in a <url> element). However, I'm not sure how easy it will be to update the link regex to have a pattern that matches both standard links and also the ones in any RSS feed.