Funnelback include/exclude pattern no subdomains

Hi there, wondering if there is a way to exclude all subdomains from a Funnelback crawl/index. We have a situation where the main site is the top-level domain, but spin-off sites come up semi-regularly with subdomains off our main one without letting us know. Since they’re subdomains they are included in crawl and therefore in the site search results. I know how to exclude specific subdomains, but since subdomains can get spun up at any time I’d like a catch-all.

E.g.
We currently include all of ‘economicdevelopment.vic.gov.au’. We want to exclude ‘[anything].economicdevelopment.vic.gov.au’.

Cheers

1 Like

Hi quimby -

Regular expression exclude patterns sound like the solution to this particular problem in exclude patterns defined in collection.cfg:

exclude_patterns=regexp:(?!www)\.economicdevelopment\.vic\.gov\.au

Assuming you wanted to include www.economicdevelopment.vic.gov.au, but not any other sub-domains, this should get you pretty close.

See also:
https://docs.funnelback.com/include_and_exclude_patterns.html#regularexpressionsinincludeexcludepatterns

Thanks Gordon. I actually needed it to exclude all subdomains, but using regex like that looks like very good starting point.

In my case I solved it (just then) even more easily, I changed the include pattern from:

economicdevelopment.vic.gov.au

to
http://economicdevelopment.vic.gov.au

Usually I would leave out the protocol but in this case it seems to have the effect that the crawler ignores all the subdomains, which is what I was after. Nice to have the regex option up my sleeve though.

Cheers

Nice one. Include patterns certainly support protocols, and things can get pretty gnarly if you’re wrangling dozens of TLD’s regex exclude patterns over time.

I think you’ve picked the appropriate solution to your problem there, @quimby.