Excluding content in collection settings is excluding incorrect content

Hi, I am a relatively new user of FB, and finding my way around collection admin, so I’m hoping someone here might have some pointers for me to troubleshoot this issue.

When searching our main site, results are also showing up for a sub-site. I want to exclude all content from the sub-site from our main site collection.
The URLs are structured like this:
http://mainsite.net.au
http://sub-site.mainsite.net.au

I tried adding the sub-site to the field ‘Exclude content from’ in Edit collection settings. I tried using ‘sub-site’ and ‘sub-site.mainsite.net.au’.

This worked to hide those results from the main site search. However in doing so, it has excluded other pages from the main site that aren’t part of the sub-site.

Can anyone suggest what might cause other pages to be excluded that don’t match the exclude pattern that I entered? This may be too general a question but if anyone can identify common causes that I could look into that would be mush appreciated.
Thanks

Hi eaustin,

You can try adding the protocol to your include pattern which will prevent Funnelback from go off and gathering sub-domains. Something like this:

If you requirements get a bit more complicated, both the include and exclude patterns also support regular expressions.

However, I would try to avoid that if possible.

Hope this helps.

Thanks,

~Gioan

Hi gtran, thanks for the tip. I did find that suggestion on another post ( Funnelback include/exclude pattern no subdomains - #3 by quimby ) and I tried it but it didn’t seem to make a difference. The same page from the main site was still excluded.

hmm that’s strange.

Would you be able to provide more examples of the urls you would like to crawl and exclude? I can then try and get it working in our test environments so that I can provide you with some sample configurations.

If the content you are trying to crawl is not public, would you be able to provide examples using another site which mimics the same structure?

Thanks,

~Gioan

Hi gtran,

The URLs are all public, so I can share some examples here. We want to crawl everything that starts with /raisingchildren.net.au, and exclude anything that starts with /birthchoices.raisingchildren.net.au

As I mentioned, adding the exclude pattern ‘birthchoices’ seems to have worked to remove the birthchoices URLs from the collection, but at least one article that doesn;t have birthchoices in the URL was also excluded. If I can work out what caused this I can see if there are other examples that I just haven’t found yet.
As a new user I can only add 2 links to a post. I’ll see if I can add a few more to a subsequent post.

Some Include/Exclude examples:
INCLUDE:

EXCLUDE

The example that I found that was excluded was this page (which is served up at two possible URLs):

INCLUDE:

https://raisingchildren.net.au/guides/a-z-health-reference/doula

EXCLUDE

https://birthchoices.raisingchildren.net.au/compare_care_options/birth_centre/doulas/index.html

Hi eaustin,

Thanks heaps for the info. After some testing, I believe the problem stems from the use of javascript.

Currently, Funnelback does not have support crawling content which is generated by javascript.

For example, https://raisingchildren.net.au/pregnancy/labour-birth/preparing-for-birth

Javascript on:

Javascript off:

Because the links are not visible, they Funnelback is unable to discover these urls which explains why https://raisingchildren.net.au/pregnancy/labour-birth/preparing-for-birth/baby-is-overdue is not being returned.

A workaround you could use is to instruct Funnelback to consume the sitemap.xml using the following setting:

However, I noticed that the sitemap for https://raisingchildren.net.au/sitemap.xml is producing errors. I would suggest getting that working properly as it would also benefit other search engines like Google and Bing.

Hope this helps.

For completeness, I have included configurations and demo link for the test collection I used which crawls about 1060 pages on https://raisingchildren.net.au/:

Configuration:

Demo Link:

Thanks,

~Gioan