Duplicates and Meta Collections

Hi - hopefully a quick answer to this one.

I have a meta collection linked to 2 web collections (A and B)

Collection A has 150 pages
Collection B has 10,000 pages

The pages in collection A are also in collection B

For these pages that are in BOTH collections - they show up in the results twice when I search the meta collection.

Is there any way to have the meta collection remove duplicates automatically - or is my only solution to ensure each web collection has unique documents?

Thanks in advance

Karl

Hi Karl,

Unfortunately this is expected and there is no way to remove duplicates from the meta collection if the URLs are present in separate indexes.

Is there a reason for having a separate collection for the 150 pages? If it’s just to be able to search over the small subset of pages separately you can achieve this in the single larger collection.

I would recommend that you try to ensure that you don’t have overlap in the separate collections, however if this is not possible you should be able to use result collapsing (Result collapsing - Funnelback Documentation - Version 15.16.0) to suppress the duplicates, collapsing on the URL.

regards,
Peter

Thanks for the clarification Peter.

The separate collection is so I can update recent content without having to crawl the 20,000 urls in the main website.

Big crawls can take a long time and I need recent content to be available as soon as it is published.

We used to do instant update feeds - but often ran into issues with lock files while the main crawl was taking place or multiple instant updates tried to run at once.

I don’t really want to use a push collection - as that’s a big workload to get the API up and running and configure Matrix to act on all of the different events.

I should be able to work out a way to ensure the content in each collection is unique - and will also use the result collapsing option as a failsafe backup.

Thanks again for your help

Karl

Hi Karl,

Ensuring there isn’t overlap in the include patterns is probably the only way you’ll achieve what you want in this instance without messy workflow.

Push collections with web content are not a good idea unless you can guarantee the URLs (or you’ll get duplicates), and have a mechanism to remove expired content.

cheers,
Peter