I’m attempting to crawl a Matrix 6 HTTPS site that is currently under construction but Funnelback is failing. I’ve read through a lot of the documentation and there is little information on the process for doing so.
I’ve set the following parameters:
crawler.protocols set to HTTPS
crawler.form_interaction.pre_crawl.*.url: URL of the site as it has the login form
For authentication I have tested both:
matrix_user matrix_password and
http_user http_password
These are not working.
I’ve successfully crawled a live site in our system, but coming unstuck crawling something that is under construction.
Under construction pages in matrix can only be viewed by a logged in user.
To index these in Funnelback you’ll need to crawl your Matrix site as an authenticated user (that has read permissions to view the pages you expect to see in the search results).
When configuring authentication with Matrix you normally need to configure form authentication - with something like this in your data source configuration:
HTTP Basic authentication will only work in Matrix if that has explicitly been configured on the server (and in the DXP this probably isn’t an option).
Don’t forget to remove this form interaction configuration once your site is live though otherwise you’ll continue to index unpublished content.
Note also that you’ll index anything that the configured user can see, so if you personalise the page for that user those personalisation will be included in the index.
Yes, this is for testing a new site. We would like to know that FB is working and how the results appear before we go live.
The matrixlogin parameters you listed are not an option for the data source configuration of type Web. The following is available: crawler.form_interaction.pre_crawl.*.url but the .*. is replaced by the compulsory Group ID which has to be a number.
Ah sorry, I wrote those instructions for interacting with Matrix quite a while back and at the time the documentation indicated that the * could be any identifier. Just update the keys in what I sent to use a number instead
The actual value used for the * isn’t actually important, it just needs to be the same for the four linked keys (as this is what groups the options together).
Double-check your credentials and ensure they’re correct.
Make sure your login form script handles all necessary fields.
Ensure your crawler is set to handle cookies.
Verify the site’s SSL certificate is valid and trusted by the crawler.
Enable detailed logging to identify any specific issues.