Crawling paginated json files

This is actually possible with the original setup. The key points are:

  • Link parsing runs prior to the filter step, so you should parse on JSON
  • Take care of whitespace!

The settings that worked for me in a local test (mimicking your JSON layout):

crawler.link_extraction_group=1
crawler.link_extraction_regular_expression="next"\s*:\s*"(.*?)"