Is it possible to find out how a particular page was added to the crawl?

dcook · May 22, 2019, 10:26pm

I’ve been asked why a certain URL is being crawled. It is publicly accessible and within the allowed domain, etc, so is correctly being indexed. But the URL shouldn’t have been discoverable.

So we’re wondering, how did the crawler find it? When looking at crawl.log, I can see the URL, and looking at the URLs preceding I don’t see any links, but then there are a lot and I’m aware it could have been from any of them.

So is there a way to report on the referrer of a particular URL? Or a log of where each URL was first found?

plevan · May 22, 2019, 10:54pm

Is sitemap.xml support enabled and if so have you checked the site’s robots.txt to see if there is linked sitemap (and if the URL is listed in it).

You could also try searching the index for the URL in the h: metadata field.

e.g. if the URL your are interested is http://site.com/path/to/page.html try searching the index for h:site.com/path/to/page.html

I think this should show pages that link to the page within the index.

dcook · May 22, 2019, 10:58pm

Thanks Peter, I haven’t tried, but that h field sounds like what I was after.

But I did forget that a sitemap was being used, and yes this ‘undiscoverable’ URL was in fact in the site map. Well there’s the problem! I’ve fixed the problem there now.

plevan · May 22, 2019, 11:20pm

Glad the solution was fairly straightforward. I’m not entirely sure how the h metadata works or if that would indeed give you the answer that you wanted.