Indexing pages with google tag manager code

Hello everyone.

Has anyone ever had an issues getting the funnelback spider to work with google tag manager snippets?

Precisely, we’re seeing an issue on one of our pages where the spider doesn’t seem to know what to do with the src element of the javascript.

<noscript><iframe src="//www.googletagmanager.com/ns.html?id=CODE"
				height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
				<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
				new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
				j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
				'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
      })(window,document,'script','dataLayer','CODE');</script>

What happens is we start seeing errors of the spider trying to access URL’s like http://oursite.co.uk/courses/courseid/gtm.start and oursite.co.uk/courses/courseid/gtm.js when the spider is crawling oursite.co.uk/courses/courseid/

It doesn’t happen using a browser and browsing to oursite.co.uk/courses/courseid/

Anyone got any ideas?

Funnelback may be attempting to extract URLs for parsing from that Javascript, producing a corresponding 404 error in your web logs (and also in $SEARCH_HOME/data/$COLLECTION/live/logs/url_errors.log). These aren’t showstoppers, but you could prevent the errors from occurring by:

  1. Adding gtm.start to your exclude patterns for the collection (default exclude patterns for web collections contain several Google Analytics URL snippets)
  2. Disable Funnelback’s Javascript link extraction behaviour in collection.cfg (crawler.extract_links_from_javascript=false)

A full update is required for these changes to take effect.

See also:

Gordon’s assessment is correct - if you see lots of repeated errors in in your url_errors.log that are not real URLs the likelihood is that the JS link extractor is the culprit.

This is definitely the case when you see gtm.js errors in your log file. Turning off the JS link extractor and running an update will indeed solve your problem.

Thanks for your help both.

We’ll turn off the JS link extractor and run the update.

Thanks again.

Just thought of something.

If we’ve got some javascript that is creating URL’s on the fly that we want indexed, then switching this off will stop that working won’t it?

I’m not sure we have that javascript, but will need to check if my suspicions are correct.

If you’ve got some client-side scripting that’s generating URLs on the fly, these won’t be detected by the Funnelback webcrawler.

When visiting a URL, Funnelback would only see the page as generated by a user agent with Javascript disabled.

Thanks for your help with this Gordon.

That’s correct - the Funnelback crawler doesn’t actually know how to process or run Javascript and will basically see a web page in an equivalent way to how a web page looks when you view it with Javascript disabled in your browser.

The Javascript link extractor is just some parsing code that attempts to find anything in Javascript blocks that looks a bit like a link, but mostly this just generates false links that result in 404s when they are tried.