Prevent LLM scrapers/trawlers?

Thank you so much for sharing this @Josue! I just added this to my YunoHost configuration.

Nevertheless, I’d really like to go beyond the simple copy-pasting, and understand what this script does. I see that it performs a regex on known AI bots, but where does this go and how?

Imho, two improvements could be added :

1 Like

Why choosing this error? To fool and make fun of the AI I’d choose 418, then :zany_face:

How would you suggest to do this?

444 is no response from the server. Imho, the less information you give to attackers/abusers, the best it is.

I stupidly made a bash which wget the file and added it to my crontab, then used it in the hook with cat (in this example, file is /home/ai-robots_list :

[...]
cat << EOF >> $nginx_conf

# Some really bad bot with legacy user agent
if (\$http_user_agent ~* "(iPod|MSIE|Trident/|Presto/|PPC Mac OS X|Gecko/\\d{4}-|C(?:riOS|hrome)/(?:\\d{1,2}|1[0-1]\\d|12[0-4])\\.|F(?:irefox|xiOS)/(?:[0-9]{1,2}|1[1-2][0-9]|130)\\.|Version/(?:[4-9]|1[0-6]).*Safari/)") {
    return 444;
}

# List from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf
EOF
cat /home/ai-robots_list >> $nginx_conf
1 Like

(Removed while debugging an issue)

2 Likes

Oops, something is wrong… Now I am getting tons of NGINX error messages (log) in the diagnosis…

I deleted the file and successfully regenerated the conf, but everything is still unreacheable via Web, and I can only access the server via SSH… Any ideas, @tituspijean?

I cannot replicate your issue :confused:
What’s the output of sudo nginx -t and sudo systemctl status nginx ?

1 Like

I panicked so I reset everything and got back to @Josue’s script without wget. It works well enough for me, and I don’t understand the details sufficiently enough to experiment with more developed settings.

Thank you for your help :sunflower:

Hey,

Sorry to dig up the discussion, but just create the script like that, don’t you have to apply permissions?

1 Like

thanks! I want to implement this… or this will be a default on yunohost?
If wanting to implement it, should I just create a
/usr/share/yunohost/hooks/conf_regen/97-nginx_rebots-block
with that code?

is this up to date?

thanks!

2 Likes

I haven’t done this fix yet, are folks satisfied with it? Is it cutting down scraper traffic significantly?

Are folks combining this with any rate-limiting approaches?

It would be nice if this went into default Yunohost config!

1 Like

I second this. It is a great piece of software - but I’m unsure how far it is supported by yunohost

I think it is working for me at least

Hi,
Did you manage to install it without any problem?

Yeah, I think so. I just edited the file.

Just came across this fork of nginx bad bots blocker specifically for Fediverse servers

Why this fork?

  • The default configuration for this blocker interferes with fedi software, such as Mastodon/GoToSocial/IceShrimp/Akkoma from federating correctly.
  • It also blocks a lot of Tor exit nodes as a result of them getting caught up in bad traffic.
  • This semi hard fork of the project exists to solve this, so it’s suitable for fedi admins and people who wish to have their services available to tor users. This is achieved by having a list of keywords for removal, along with retrieving the list of all Tor exit nodes from TorProject to remove matches.
  • Also in addition to the above purposes, I’ve made the deny.conf compatible for running Anubis or go-away behind this blocker.
  • And lastly, this is a semi hard fork which is able to stay working and updated, even when upstream is broken. I used to just merge and comment out matches. Now I generate the blocklist independantly using the lists provided from upstream, plus my own, and most importantly, retrieve the 10,000 top reported IP list from AbuseIPDB’s api directly. You still should use instructions from upstream for installation though.
1 Like