Categories
Generally Sysadmin

How to block site grabber bots even if they are using the tor network

If you are managing websites displaying data from non free third party APIs like stock exchange, geo location or travel agency data, sooner or later you will notice site grabbers trying to load that data from your website. Even more so if the data changes on daily bases. Blocking those can lead to a never-ending task.

Not even may those grabbers cause more costs on API access, traffic and CPU time, they may also distort your local website statistics. Javascript based analysis tools won’t be affected because those grabber scripts rarely parse javascript.

Usually you would just block the IP within your firewall appliance or iptables setup. But serverless setups are easily switched and even rented servers can get a new IP address quite fast. Even more so if the grabber script is using the tor network meaning the requests will come from thousands of different IP addresses.

Managing all those IPs within a firewall appliance or even iptables can get quite complicated and time-consuming. This is where ipset in combination with iptabels jumps in, helping you to keep all those IPs neatly in one place. Easy to be managed.

Ipset

Ipset is a iptables module that comes with its own command line tool.

On Debian/Ubuntu installing it is quite easy:

apt-get install ipset

New lists can be created using ipset -N. For now we create a new list called “grabber” where we will store our IPs to be blocked in a hash table.

ipset -N grabber iphash

Now we need to tell iptables that it should use our new list

iptables -A INPUT -m set --match-set grabber src -j DROP

You should change this to work with your current iptables setup if there already is one. This line now drops all traffic from source IP addresses that are found in the ipset list “grabber”.To have this working even after a reboot, that line has to be added to a script run at boot time.

Tor network

In case the grabbers are using the tor network to launch der requests blocking every single IP manually won’t suffice. In that case we can fetch the currently known list of tor exit node IPs automatically and add them to the list.

Please note: This also blocks the access from regular tor users to your server. If your website should be visible via the tor network this method won’t work for you.

The following bash script fetches all the IPs and adds them to our list:

#!/bin/bash
IP="127.0.0.1"
wget -q https://check.torproject.org/cgi-bin/TorBulkExitList.py?ip=$IP -O -|sed '/^#/d' |while read IP
do
  ipset -q -A grabber $IP
done

This should be run on a regular basis. Maybe even every couple of minutes.

Enhancements

With this setup you can easily create your own script that check server logs for suspicious requests and block them automatically using the ipset command. Just be sure not to add your own IP by accident 😉

Sources:

  • http://mikhailian.mova.org/node/194

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.