Alright, so AI Scrapers? I’m sure you’ve heard, they’re kinda awful. Go figure. To give you an idea; I recently had the experience of a couple of scrapers hitting my Forgejo server at the same time. I saw a peak of 33 requests/second, which is just not sustainable on a site like that - They were crawling git history, which is an expensive operation, and was eating up ridiculous volumes of memory, and importantly, CPU time - The i7-6700K in that box was stuck at 100% load for hours before I noticed.
At that point I decided enough was enough, and went to dig up how to set up Anubis on Nix. It’s not… particularly documented that I could find, so I ended up having to work it out myself. So here’s what I came up with!
I’m going to have a fake service here, fill with your own:
let
address = "127.0.0.1:1337";
in
{
services.fakeServer = {
enable = true;
address = address;
# That could be a port, address, whatever.
};
services.nginx = {
# nginx boilerplate
virtualHosts = {
"fakeServer.krutonium.ca" = {
locations."/".proxyPass = address;
};
};
};
}
So what we have here is nginx. Right now it’s directly proxying traffic destined to fakeServer.krutonium.ca to the service, and honestly that’s usually just fine, especially if it’s a static site.
In our case though, we know that it’s not the case. Fake Server has to do somthing expensive on certain pages.
So! How do we defend it? Anubis!
What we need to do is stick Anubis into the middle of the chain, and let it handle it. Even with mostly default settings, it’s pretty solid.
let
address = "127.0.0.1:1337";
in
{
services.fakeServer = {
enable = true;
address = address;
# That could be a port, address, whatever.
};
services.nginx = {
# nginx boilerplate
virtualHosts = {
"fakeServer.krutonium.ca" = {
locations."/".proxyPass = "http://unix:/run/anubis/anubis-fakeServer/anubis.sock:/";
};
};
};
# And now for the Anubis Magic: It's going to sit in between fakeService and NGINX.
services.anubis.instances = {
fakeService = {
enable = true;
group = "nginx"; #IMPORTANT - This must match nginx's group, or it won't be able to read it!
settings = {
# How hard the proof-of-work challenge is (higher = harder for bots)
DIFFICULTY = 5;
# Where Anubis forwards legitimate traffic
TARGET = address;
# Where to point NGINX
BIND = "/run/anubis/anubis-fakeService/anubis.sock";
# Where to send Statistics - You can plug this into Grafana or whatever you use.
# In theory these can also be TCP, but from what I can tell that mode is being depricated.
METRICS_BIND = "/run/anubis/anubis-fakeService/anubis-metrics.sock";
# You should also let it serve your ROBOTS.TXT as it'll help prevent well behaved bots to behave.
# Looking at you Facebook/Meta, who I am making a point here, ignored it.
SERVE_ROBOTS_TXT = true;
};
};
};
}
And there you have it! Your site is now protected by Anubis!… But are you done yet? If your site doesn’t require Javascript to work, or you care about making sure your users can opt into not using it without issues, then you should modify the Anubis setup like this; It replaces the Javascript based Proof of Work challenge with a MetaTag challenge - Easier to bypass for bots, but equally as effective at the moment, and importantly, depends not on Javascript, but your browser understanding meta tags.
services.anubis.instances = {
fakeService = {
enable = true;
group = "nginx"; #IMPORTANT - This must match nginx's group, or it won't be able to read it!
botPolicy.bots = [
{
# This segment looks for Clients with Mozilla or Opera in their UserAgent fields, which
# honestly covers basically every consumer browser and changes it to use the `metarefresh` challenge instead
name = "generic-browser";
user_agent_regex = "Mozilla|Opera";
action = "CHALLENGE";
challenge = {
difficulty = 5;
algorithm = "metarefresh";
};
}
];
settings = {
# How hard the proof-of-work challenge is (higher = harder for bots)
DIFFICULTY = 5;
# Where Anubis forwards legitimate traffic
TARGET = address;
# Where to point NGINX
BIND = "/run/anubis/anubis-fakeService/anubis.sock";
# Where to send Statistics - You can plug this into Grafana or whatever you use.
# In theory these can also be TCP, but from what I can tell that mode is being depricated.
METRICS_BIND = "/run/anubis/anubis-fakeService/anubis-metrics.sock";
# You should also let it serve your ROBOTS.TXT as it'll help prevent well behaved bots to behave.
# Looking at you Facebook/Meta, who I am making a point here, ignored it.
SERVE_ROBOTS_TXT = true;
};
};
};
And just like that, fakeService is now being transparently handled by Anubis, and it’s even Javascript-free friendly!
Feel free to drop me a message on Mastodon if this helped you out, or if you need any help/need to clarify anything.
Have a great day, and Happy Nixing