Indexing Redirect URLs

I have a scenario where I am indexing a number of external websites but some of the start URLs redirect to other links. We need to be able to populate the start URLs with a url such as “mysite.com” but when the crawler runs it will be redirected to a different URL (e.g. “redirect.com”) - this is fine but when it comes to external_metadata configuration - ideally we would like to populate this file with the known URL mysite.com rather than the redirected URL. Is there anyway to configure the crawler to handle redirects?

For example - if my external_metadata.cfg contains:

mysite.com disclaimer:this is a third party site

I would want this disclaimer to be applied to mysite.com and also any sites to which the crawler is redirected (rather than linked). Is this possible or would we need to explicitly list each redirect URL in the external_metadata.cfg?

Hi Jon,

There isn’t a lot you can do to control the crawler’s redirect behaviour.

There is a server_alias.cfg you can user to set a preferred server name when there are aliases for a server but it seems to me that the easiest solution here might be to apply your external metadata disclaimer to pages on your internal site and use your template to check if this field exists or not. If it doesn’t exist then print out your disclaimer message.

So you might have something like

internalsite.com disclaimer:"this is a your internal website"

then in your freemarker you can do something like

<#if !s.result.metaData["disclaimer"]??>
  Disclaimer: This is an third party site
</#if>

Thanks - I’ll maybe just work with the redirected URLs instead.