Prevent LLM scrapers/trawlers?

tommi · July 27, 2025, 10:27am

Thank you so much for sharing this @Josue! I just added this to my YunoHost configuration.

Nevertheless, I’d really like to go beyond the simple copy-pasting, and understand what this script does. I see that it performs a regex on known AI bots, but where does this go and how?

Kit · July 27, 2025, 12:15pm

Josue:

#!/bin/bash

action=$1
pending_dir=$4
nginx_conf="${pending_dir}/../nginx/etc/nginx/conf.d/security.conf.inc"

[[ "$action" == "pre" ]] || exit 0
[[ -e "$nginx_conf" ]] || exit 0

cat << EOF >> $nginx_conf

# Some really bad bot with legacy user agent
if (\$http_user_agent ~* "(iPod|MSIE|Trident/|Presto/|PPC Mac OS X|Gecko/\\d{4}-|C(?:riOS|hrome)/(?:\\d{1,2}|1[0-1]\\d|12[0-4])\\.|F(?:irefox|xiOS)/(?:[0-9]{1,2}|1[1-2][0-9]|130)\\.|Version/(?:[4-9]|1[0-6]).*Safari/)") {
    return 403;
}

# List from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf
if (\$http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
    return 403;
}
EOF

Imho, two improvements could be added :

response 444 instead of 403
Download once a week https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/nginx-block-ai-bots.conf and copy it instead of hardcoding outdated list in the hook.

tommi · July 27, 2025, 1:34pm

Why choosing this error? To fool and make fun of the AI I’d choose 418, then

How would you suggest to do this?

Kit · July 27, 2025, 1:45pm

444 is no response from the server. Imho, the less information you give to attackers/abusers, the best it is.

I stupidly made a bash which wget the file and added it to my crontab, then used it in the hook with cat (in this example, file is /home/ai-robots_list :

[...]
cat << EOF >> $nginx_conf

# Some really bad bot with legacy user agent
if (\$http_user_agent ~* "(iPod|MSIE|Trident/|Presto/|PPC Mac OS X|Gecko/\\d{4}-|C(?:riOS|hrome)/(?:\\d{1,2}|1[0-1]\\d|12[0-4])\\.|F(?:irefox|xiOS)/(?:[0-9]{1,2}|1[1-2][0-9]|130)\\.|Version/(?:[4-9]|1[0-6]).*Safari/)") {
    return 444;
}

# List from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf
EOF
cat /home/ai-robots_list >> $nginx_conf

tituspijean · July 28, 2025, 9:01pm

(Removed while debugging an issue)

tommi · July 28, 2025, 9:57pm

Oops, something is wrong… Now I am getting tons of NGINX error messages (log) in the diagnosis…

I deleted the file and successfully regenerated the conf, but everything is still unreacheable via Web, and I can only access the server via SSH… Any ideas, @tituspijean?

tituspijean · July 29, 2025, 6:03am

I cannot replicate your issue
What’s the output of sudo nginx -t and sudo systemctl status nginx ?

tommi · July 29, 2025, 8:28am

I panicked so I reset everything and got back to @Josue’s script without wget. It works well enough for me, and I don’t understand the details sufficiently enough to experiment with more developed settings.

Thank you for your help

JfmbLinux · August 29, 2025, 2:33pm

Hey,

Sorry to dig up the discussion, but just create the script like that, don’t you have to apply permissions?

geoma · October 13, 2025, 5:07pm

thanks! I want to implement this… or this will be a default on yunohost?
If wanting to implement it, should I just create a
/usr/share/yunohost/hooks/conf_regen/97-nginx_rebots-block
with that code?

is this up to date?

thanks!

okam_rzr · October 24, 2025, 5:25am

I haven’t done this fix yet, are folks satisfied with it? Is it cutting down scraper traffic significantly?

Are folks combining this with any rate-limiting approaches?

It would be nice if this went into default Yunohost config!

softpillow · October 25, 2025, 1:12am

I second this. It is a great piece of software - but I’m unsure how far it is supported by yunohost

geoma · November 10, 2025, 11:25pm

I think it is working for me at least

JfmbLinux · November 11, 2025, 3:34pm

Hi,
Did you manage to install it without any problem?

geoma · December 25, 2025, 12:54pm

Yeah, I think so. I just edited the file.

okam_rzr · December 28, 2025, 9:22pm

Just came across this fork of nginx bad bots blocker specifically for Fediverse servers

Why this fork?

The default configuration for this blocker interferes with fedi software, such as Mastodon/GoToSocial/IceShrimp/Akkoma from federating correctly.

It also blocks a lot of Tor exit nodes as a result of them getting caught up in bad traffic.

This semi hard fork of the project exists to solve this, so it’s suitable for fedi admins and people who wish to have their services available to tor users. This is achieved by having a list of keywords for removal, along with retrieving the list of all Tor exit nodes from TorProject to remove matches.

Also in addition to the above purposes, I’ve made the deny.conf compatible for running Anubis or go-away behind this blocker.

And lastly, this is a semi hard fork which is able to stay working and updated, even when upstream is broken. I used to just merge and comment out matches. Now I generate the blocklist independantly using the lists provided from upstream, plus my own, and most importantly, retrieve the 10,000 top reported IP list from AbuseIPDB’s api directly. You still should use instructions from upstream for installation though.

okam_rzr · March 31, 2026, 9:48pm

Another project I just came across

Miasma will send poisoned training data from the poison fountain alongside multiple self-referential links. It’s an endless buffet of slop for the slop machines.

Miasma is very fast and has a minimal memory footprint - you should not have to waste compute resources fending off the internet’s leeches.

arkadi · April 9, 2026, 3:53am

Iocaine is cool too but only for Caddy. I’m using it on a non-Yunohost server I have.

valen · April 16, 2026, 7:04pm

I also want to integrate a similar system. Based on your experience, what is the best implementation to do (simple to maintain and efficient)? Will there soon be a native equivalent system to Yunohost?