Indexing pages with google tag manager code

vanderkerkoff 2017-05-30 16:18:20 UTC #1

Hello everyone.

Has anyone ever had an issues getting the funnelback spider to work with google tag manager snippets?

Precisely, we're seeing an issue on one of our pages where the spider doesn't seem to know what to do with the src element of the javascript.

<noscript><iframe src="//www.googletagmanager.com/ns.html?id=CODE"
				height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
				<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
				new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
				j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
				'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
      })(window,document,'script','dataLayer','CODE');</script>

What happens is we start seeing errors of the spider trying to access URL's like http://oursite.co.uk/courses/courseid/gtm.start and oursite.co.uk/courses/courseid/gtm.js when the spider is crawling oursite.co.uk/courses/courseid/

It doesn't happen using a browser and browsing to oursite.co.uk/courses/courseid/

Anyone got any ideas?

gordongrace 2017-05-30 16:33:19 UTC #2

Funnelback may be attempting to extract URLs for parsing from that Javascript, producing a corresponding 404 error in your web logs (and also in $SEARCH_HOME/data/$COLLECTION/live/logs/url_errors.log). These aren't showstoppers, but you could prevent the errors from occurring by:

Adding gtm.start to your exclude patterns for the collection (default exclude patterns for web collections contain several Google Analytics URL snippets)
Disable Funnelback's Javascript link extraction behaviour in collection.cfg (crawler.extract_links_from_javascript=false)

A full update is required for these changes to take effect.