I think you can probably still use a similar solution.
I would probably use a pre-gather workflow script to fetch and process the XML applying different logic to public and private documents.
The workflow would implement the logic from my previous post (generating a seed list and external metadata) for the items flagged as public.
For the private items it would generate an XML file (or individual XML files) that just contain the private records.
You would then crawl the PDFs in the seed list and index these along with the XML records for the private files.
You would then index all of this - the private files via the XML records which contain only metadata and the public files via the crawled PDFs + the metadata attached using external metadata.
It may be easier to come up with a solution that uses separate collections for private and public (so have a private collection that generates the seed list and external metadata for the public collection, and also discards the public XML items). You could then chain these collections via workflow - update the private collection which generates the config for the public collection, and have it also start the update on the public collection as a post update workflow step.