Lately the amount of spam/robot submissions to our search has largely increased. By spam submission I mean - http://likespromotion.com/, https://www.youtube.com/watch?v=... , http://www.clashofclanshackforfree.com … It doesn’t give us big load yet but still creates a lot of noise in our reports and logs.
I am already using reporting-blacklist.cfg but unfortunately it won’t work in this case. I would like to exclude all queries starting with http(s)*:// and blacklist doesn’t support regex to my knowledge.
Does anyone know a way we could filter those robot/spam submissions ideally before processing queries or at least at the reporting time?
Thanks,
Vitali
I'd start by focusing on preventing those queries ever getting to Funnelback - input patterns in HTML5 might be useful, but there's no guarantee that the spam bot would respect them.
If you're using the Modern UI, pre-process hook scripts would be worth investigating - a query can be examined and, prior to processing, be compared against undesirable patterns. You may want to then set the query to empty, or generate a state that would be similar to the initial search from.
See the example at: http://docs.funnelback.com/user_interface_hook_scripts.html#Processing%20additional%20input%20parameters
Alternatively, you may want to configure your initial forms to submit a benign URL key/value pair that is generated by user agents that support client-side scripting - the presence of this value in a form submitted to Funnelback can also be examined as a pre-process hook, with termination occurring in processing if the value is absent.
IP address filtering in the reporting blacklist might also be useful - assuming these spam bots are coming from a fixed set of IP addresses.
Thanks Gordon,
Unfortunately in our environment we don’t have all control to web pages and where search form is used. I will need to look into pre-process hooks than.
I filter some of them via IP but the list keeps growing.
Vitali