I currently have a web collection that crawls an XML file of records which are then split into individual results using XML processing and Metadata mapping.
This works for all the content that is included in the XML.
However the XML records contain a link to a PDF file for each, this is the link used for the linked result but the content of the file is not currently indexed.
Is there a way for Funnelback to crawl each PDF and match it to the XML record it is linked from?
So the contents of the pdfs will also be searchable and will return the matching XML record result?
This refers to inner html or inner xml files but doesnt mention inner pdf files.
Example XML file below with linked pdf files.
<records>
<record>
<name>Name of document</name>
<pdf>http://www.domain.com/path/to/file.pdf</pdf>
<access>public</access>
</record>
<record>
<name>Name of another document</name>
<pdf>http://www.domain.com/path/to/file2.pdf</pdf>
<access>private</access>
</record>
</records>
The best solution would probably be to remove the XML file from your start URLs list and write a pre-gather workflow script to fetch the XML file and generate two things:
a seed list containing all the PDF URLs in the file saved as collection.cfg.start.urls in the collection’s conf folder
The XML file is dynamically generated based on the documents themselves - which are managed in a Sharepoint site.
To make matters more complex, the pdf files themselves are not all publicly accessible so some of these we would not want crawled, so only the metadata in the private tagged xml records would be searchable. While for public pdf files the metadata and the associated file should be crawled and searchable.
Are there any alternative solutions you would recommend that can make use of the existing XML structure and metadata?
I think you can probably still use a similar solution.
I would probably use a pre-gather workflow script to fetch and process the XML applying different logic to public and private documents.
The workflow would implement the logic from my previous post (generating a seed list and external metadata) for the items flagged as public.
For the private items it would generate an XML file (or individual XML files) that just contain the private records.
You would then crawl the PDFs in the seed list and index these along with the XML records for the private files.
You would then index all of this - the private files via the XML records which contain only metadata and the public files via the crawled PDFs + the metadata attached using external metadata.
It may be easier to come up with a solution that uses separate collections for private and public (so have a private collection that generates the seed list and external metadata for the public collection, and also discards the public XML items). You could then chain these collections via workflow - update the private collection which generates the config for the public collection, and have it also start the update on the public collection as a post update workflow step.