1 / 8
Jan 2020

We are exploring web collection and would like to understand if there is a host of pages for customer's branch location that they will have in the public sit. The will be in the same domain but will not be 'linked' from any page in the IA / site hierarchy. What is the mechanism to add these into search index. And is there a way to automate it i.e. will it be deleted every time a crawl happens and/or any way to persist it.

One way I can think of is have a Push collection and make this collection and the existing collection part of the 'meta' collection. Keep updating ( or deleting) the locator pages in push using api. This way meta collection will always be updated.

Any thought ?

  • created

    Jan '20
  • last reply

    Dec '22
  • 7

    replies

  • 5.2k

    views

  • 3

    users

  • 2

    likes

  • 2

    links

The simplest way is probably to create a sitemap.xml that lists all of these pages and link this from your robots.txt file. Funnelback will then fetch these pages if you enable sitemap xml support (https://docs.funnelback.com/15.24/administer/reference-documents/collection-options/crawler.use_sitemap_xml.html30)

If the website is within a CMS you might be able to create a hidden page that links to all of the URLs. Once you have this you could add this hidden URL to your start URLs and then also add a kill_exact.cfg to remove this URL from the index after your update is complete.

In option #1, will sitemap.xml itself be part or search results OR in other words, how will we make sure sitemap.xml is not searchable in itself.

With regards to option#2 - Yes, on CMS, this makes sense. And even have the Webserver configs to 301 redirect for requests to the 'hidden' page and send to locator landing page.
Thanks.

sitemap.xml won't show up in the search results but will expose the pages to Funnelback (and potentially other things that index your site like google). See: https://www.sitemaps.org/index.html5 for more info about sitemaps if you don't already know what they are.

Thanks Plevan
Is there optimal number or url's that should be in sitemap.xml from funnelback's perspective. For example is 6000 urls listed in sitemap a good number ?
I understand that it will also depend on what is there inside these url's i.e how may actual pages are linked to these

1 year later

How does one set the "crawler.use_sitemap_xml" key using the Funnelback Admin dashboard? I tried to add it to the collection configuration, but I'm not certain how and where to add it. I tried to add it using the Funnelback dashboard, but it didn't work. From the Funnelback Admin dashboard, I first selected the "Administer" tab, and then selected "Edit Collection Configuration". When I type in "crawler.use_sitemap_xml", however, the key does not autocomplete, so I cannot select it from the list of keys. What am I doing wrong?

1 year later

You are editing in the correct place - you add this to the collection configuration of the web collection where you want to crawl the sitemap.

As to why it's not showing up in the auto-complete - I'm not sure. There could be some error happening when it goes to fetch the matching keys (this might show up in the javascript console or network tab). Either way you should still be able to enter the key directly, even without the autocomplete working.

Thanks, @plevan. There was an error in our permissions that prevented me from using the key. Once that was resolved, I could add the "crawler.use_sitemap_xml" key.