Duplicate detection - problems with sparse content?

karleq 2015-05-26 14:05:32 UTC #1

Does anyone know if there is a minimum number of characters that have to be indexed in order for the automatic duplicate system to work?

Or does it work on a percentage of similar content?

I have these pages:

http://www.pmlive.com/pmhub/healthcare_advertising/taylor_james

http://www.pmlive.com/pmhub/healthcare_digital_communications/taylor_james

Which have almost everything noindex apart from a couple of small text areas - which are identical bar the different url (which appears in 2 or 3 places on the page).

I have similar pages with a lot more content and the duplication system works fine - only one url shows. EG:

http://www.pmlive.com/pmhub/healthcare_advertising/HAVAS_LYNX

http://www.pmlive.com/pmhub/healthcare_creative_and_design/HAVAS_LYNX

But the pages with a lot less content I can't get the duplication to work.

I can't find anything on this in the manuals so I'm guessing either:

1. There just isn't enough indexable content to trigger the duplicate removal

2. The different urls in the indexable content means there's a higher percentage of difference between the pages

IE: The pages with less indexable content are 95% similar (as 5 out of every 100 words are different) - the pages with more content are 99% similar (as 1 out of every 100 words are different)

So the different urls is affecting the pages with smaller indexable content disproportionally. And the duplicate removal process does not disregard the current url where it appears on the page (which I kind of thought it would)

Any help on how this works greatly appreciated as usual.

Karl

PS - Is there a way to see exactly what has been indexed for a specific url?

karleq 2015-05-28 15:47:02 UTC #2

Well, I couldn't get it to work in the end.

Spent time noindexing this and that - so the 2 pages were 100% identical regarding content outside of noindex blocks.

But that made not the slightest difference.

So I took a different tack - created a new asset listing page in matrix - set this to the start page in FB. Then ran includes to just include the profile pages themselves and no other listing pages.

Therefore trying to narrow down Funnelback's vision to a single listing page - and therefore a single url for each page.

It seems to have worked for now.

karleq 2015-06-26 14:33:56 UTC #3

Hmmm - stopped working after a bit - was worse than ever - lots of duplicates instead of one or two.

Now added crawler.max_link_distance=1 into the config file - so should hopefully start at the list of pages as above - then only go one link deep from there - therefore only indexing a single url for each page.

Seems to work fine after last re-index.

gordongrace 2016-12-01 11:01:49 UTC #4