Crawling a Matrix site that is under construction

Hello,

I’m attempting to crawl a Matrix 6 HTTPS site that is currently under construction but Funnelback is failing. I’ve read through a lot of the documentation and there is little information on the process for doing so.

I’ve set the following parameters:

crawler.protocols set to HTTPS
crawler.form_interaction.pre_crawl.*.url: URL of the site as it has the login form

For authentication I have tested both:

matrix_user matrix_password and
http_user http_password

These are not working.

I’ve successfully crawled a live site in our system, but coming unstuck crawling something that is under construction.

Am I missing something obvious here?

Cheers,
Paul

1 Like

Under construction pages in matrix can only be viewed by a logged in user.

To index these in Funnelback you’ll need to crawl your Matrix site as an authenticated user (that has read permissions to view the pages you expect to see in the search results).

When configuring authentication with Matrix you normally need to configure form authentication - with something like this in your data source configuration:

crawler.form_interaction.pre_crawl.matrixlogin.url: http//<MATRIX-SERVER>/home
crawler.form_interaction.pre_crawl.matrixlogin.form_number: 1
crawler.form_interaction.pre_crawl.matrixlogin.cleartext.sq_username: <USER-NAME>
crawler.form_interaction.pre_crawl.matrixlogin.encrypted.password: <PASSWORD>

HTTP Basic authentication will only work in Matrix if that has explicitly been configured on the server (and in the DXP this probably isn’t an option).

Don’t forget to remove this form interaction configuration once your site is live though otherwise you’ll continue to index unpublished content.

Note also that you’ll index anything that the configured user can see, so if you personalise the page for that user those personalisation will be included in the index.

There’s a bit more information on form interaction here in the documentation: Form interaction :: Squiz DXP Help Center

Yes, this is for testing a new site. We would like to know that FB is working and how the results appear before we go live.

The matrixlogin parameters you listed are not an option for the data source configuration of type Web. The following is available: crawler.form_interaction.pre_crawl.*.url but the .*. is replaced by the compulsory Group ID which has to be a number.

Am I missing something here?

Hi Paul,

Ah sorry, I wrote those instructions for interacting with Matrix quite a while back and at the time the documentation indicated that the * could be any identifier. Just update the keys in what I sent to use a number instead

e.g.

crawler.form_interaction.pre_crawl.0.url: http//<MATRIX-SERVER>/home
crawler.form_interaction.pre_crawl.0.form_number: 1
crawler.form_interaction.pre_crawl.0.cleartext.sq_username: <USER-NAME>
crawler.form_interaction.pre_crawl.0.encrypted.password: <PASSWORD>

The actual value used for the * isn’t actually important, it just needs to be the same for the four linked keys (as this is what groups the options together).

Hello

You can check below steps -

Double-check your credentials and ensure they’re correct.
Make sure your login form script handles all necessary fields.
Ensure your crawler is set to handle cookies.
Verify the site’s SSL certificate is valid and trusted by the crawler.
Enable detailed logging to identify any specific issues.

Thnak you:relaxed: