Prevent LLM scrapers/trawlers?

eivind · January 28, 2025, 11:56am

LLM trawlers are becoming more aggressive, and can consume substantial bandwidth and CPU power. For that and other reasons, I would guess most Yunohost users don’t want their sites to be trawled. Old approaches like robots.txt are not guaranteed to work, so other ways of stopping and/or poisoning LLMs are developed (iocaine, Nepenthes etc.). It is reasonable to expect that stopping LLMs is a moving target. It is unreasonable to expect individuals to cope with it.

I suggest some Yunohost service or app to make it easier to (at least) block LLM scrapers. I am unsure what would be the best approach, but I guess a firewall-like approach would be possible: To drop packets coming from known botnets, like ai.robots.txt, and to keep the blocklist updated. Other more aggressive ways are also possible, but they in turn consume more computing resources.

Please don’t see this as yet another idea for others to do something about. Hopefully, it’s the start of a discussion that may or may not lead to something.

esist0 · January 28, 2025, 12:45pm

I think it would be cool to have a little ui on the admin panel to edit the robots.txt

Also a way to ‘‘subscribe’’ to other people’s list.

To be fair tho, maybe it’s more needed at application level rather than yuno level. Like Mastodon already allows what i just said…

Probably a yuno-made frontend for fail2ban would be enough? Like being able to check the ips and the agents and just ‘‘banning’’ them 2 clicks away

I hope i’m not hijacking your discussion, it’s something i was also thinking about these days

eivind · January 28, 2025, 1:47pm

You’re not hijacking To be able to subscribe to the abovementioned ai.robots.txt is an alternative. And the Mastodon bot block has been reversed, if I understand it correctly. This means we could need a Yunohost blocking mechanism that keeps an updated list of IPs/domains to block, disregarding what individual apps may or may not do. I think a firewall drop mechanism would be better than fail2ban, as the latter is about blocking logins. We want to block access.

wbk · January 28, 2025, 10:01pm

I still have to read it, but saw a related headline on Ars, AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt - Ars Technica

eivind · January 29, 2025, 6:59am

Yes, this is one of the possibilities I mentioned above. While I don’t object to building tarpits for LLM scrapers, I don’t think it’s practical for the majority of Yunohost users. I, for one, don’t have the bandwidth and CPU power to spare for running the more aggressive solutions. I just want my content and my server to be available for real people and not language models (just as it should be for all websites), and I suggest a solution within Yunohost, preferably turned on by default.

wbk · January 29, 2025, 8:50am

On the other hand though, I’d prefer the next breakthrough in AI to be biased based on my content, rather than on a cesspit like ex-Twitter.

Edit: Parallel to this thread, there is another thread running : Tromper les butineurs IA (fool AI crawlers)

eivind · January 30, 2025, 12:36pm

I don’t agree with the premise. None of my (or anyone else’s) content should be used for “AI” training except by prior written consent. AI scraping is stealing unless agreed to in advance.

wbk · January 30, 2025, 1:29pm

With the risk of starting a religious war, “stealing” means depriving another party of something, while reading ‘for benefit’ may be an infringement

The larger battle I think is how AI is governed; whether it will only be private entities that reap the profits of jobs replaced by AI, and that the media/web will be full of mediocre (at best, nefarious most probable) AI generated content.

With enough push, AI may relieve humans of jobs they don’t enjoy with the profits flowing to the community, for example enabling an accountant that is replaced by AI to receive livable welfare while starting to write the novel he never came across to writing, or indeed let authors receive sufficient income with their own stories while enough revenue is generated by AI generated placeholders.

eivind · January 30, 2025, 4:07pm

You are right, the correct judicial term is probably infringement. My point is that Yunohost users could do well with a simple mechanism to avoid such infringement. Unfortunately, it lies on us to establishish such mechanisms, because the AI scrapers do not honour copyright laws or common courtesy. They have to be fenced out. I am proposing that Yunohost could include that fence.

As for the rest of your post, I think it falls outside the scope of possible technical solutions in Yunohost, which is what this thread started with.

wbk · January 30, 2025, 4:33pm

Unfortunately, you are also right

One way might be inserting links to external tarpits. It could be avoided easily, with some rules or heuristics on the scrapers’ side; we may obfuscate it with many fronting domains on the end of the tarpit, or by creating an entry for a subdomain of our own domain that points to such a trap. It would off-load the processing to such tarpits, but also be relatively easy for bad actors to circumvent.

It also doesn’t help that anyone can run such a scraper, rendering blocklists somewhat unwieldy.

That does remind me, OPNsense has a plugin for Crowdsec. My firewall harvests suspicious activity, which combined with input from your firewall and of others, will drop an IP on a blacklist or take it off.

It relembles @esist0 suggestion, but than automated.

eivind · January 31, 2025, 7:02am

Yes, something like that, maybe? Combined with ai.robots.txt, simply a list of IPs to block on firewall level would be manageable even for small servers. I see the point in redirecting scrapers to tarpits, but that seems rather more to manage and also, as you say, possible to circumvent.

So, in the end, what we might need is for the Yunohost firewall (which is ufw, if I remember correctly) to include a list of IPs to drop, and a script to update that list from time to time.

Alternatively, people might also be interested in an app which redirects the trawlers to tarpits.

okam_rzr · January 31, 2025, 5:33pm

Thanks for starting this discussion, I had also been thinking about this.

I’m sympathetic to the bot tarpit approach, but would be happy just to add them to fail2ban, and not let them waste any additional cpu cycles.

I while back I had installed darkvisitors to my website, and dismayed but not surprised to see that 90% of the traffic was coming from these llm/scraper bots.

I think a project like this would be great to add somehow to yunohost, and from a glance seems compatible:

eivind · February 1, 2025, 9:10am

Yes, this is possibly a good solution. I think fail2ban misses the point a bit, because this is not about stopping login attempts, which fail2ban does well. This is about stopping trawlers from accessing public-facing content, like Mastodon feeds, forums, blogs, webpages etc. that don’t require login. That looks more like a firewall’s task.

okam_rzr · April 18, 2025, 5:20am

Anubis is another new anti scaper software, recently deployed on a unesco subdomain

anijatsu · May 6, 2025, 8:48am

I saw it appear on the wishlist, and there’s even a testing repo already!

However the latter is not released for a reason - it has no instructions, and looking through the source code myself, I think it’ll need to continue development for a while.

Since I understand Anubis usually hooks into the web server’s config - like Apache/nginx, and thus it’d likely have to modify most of YNH applications’ configs. I didn’t see that happen the last time I tried it (with the last commit version).

Josue · May 6, 2025, 4:17pm

Hello,

On my side I just blocked the bot by user agent on my nginx config. It won’t block all bot but this config should block most of then and also some really bad bot.

I just added this following content into: /usr/share/yunohost/hooks/conf_regen/97-nginx_rebots-block:

#!/bin/bash

action=$1
pending_dir=$4
nginx_conf="${pending_dir}/../nginx/etc/nginx/conf.d/security.conf.inc"

[[ "$action" == "pre" ]] || exit 0
[[ -e "$nginx_conf" ]] || exit 0

cat << EOF >> $nginx_conf

# Some really bad bot with legacy user agent
if (\$http_user_agent ~* "(iPod|MSIE|Trident/|Presto/|PPC Mac OS X|Gecko/\\d{4}-|C(?:riOS|hrome)/(?:\\d{1,2}|1[0-1]\\d|12[0-4])\\.|F(?:irefox|xiOS)/(?:[0-9]{1,2}|1[1-2][0-9]|130)\\.|Version/(?:[4-9]|1[0-6]).*Safari/)") {
    return 403;
}

# List from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf
if (\$http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
    return 403;
}
EOF

The advantage is that the config is lightweight and easy to setup. But you could have at some point a bot which pass over this.

eivind · May 7, 2025, 6:21am

Thank you for this. I will try it! I haven’t seen any other working alternative so far, so this (and the fact that it is lightweight) is welcome.

arkadi · May 8, 2025, 7:21am

Caddy has a module, Caddy Defender.

might there be some modules like this for NGINX?

SiM · May 16, 2025, 6:14am

Thanks @Josue for sharing.
I think there is a typo on your second “if” :

if ($http_user_agent ~*

Should be

if (\$http_user_agent ~*

Josue · May 16, 2025, 5:32pm

You are true, fixed