LLM trawlers are becoming more aggressive, and can consume substantial bandwidth and CPU power. For that and other reasons, I would guess most Yunohost users don’t want their sites to be trawled. Old approaches like robots.txt are not guaranteed to work, so other ways of stopping and/or poisoning LLMs are developed (iocaine, Nepenthes etc.). It is reasonable to expect that stopping LLMs is a moving target. It is unreasonable to expect individuals to cope with it.
I suggest some Yunohost service or app to make it easier to (at least) block LLM scrapers. I am unsure what would be the best approach, but I guess a firewall-like approach would be possible: To drop packets coming from known botnets, like ai.robots.txt, and to keep the blocklist updated. Other more aggressive ways are also possible, but they in turn consume more computing resources.
Please don’t see this as yet another idea for others to do something about. Hopefully, it’s the start of a discussion that may or may not lead to something.
You’re not hijacking To be able to subscribe to the abovementioned ai.robots.txt is an alternative. And the Mastodon bot block has been reversed, if I understand it correctly. This means we could need a Yunohost blocking mechanism that keeps an updated list of IPs/domains to block, disregarding what individual apps may or may not do. I think a firewall drop mechanism would be better than fail2ban, as the latter is about blocking logins. We want to block access.
Yes, this is one of the possibilities I mentioned above. While I don’t object to building tarpits for LLM scrapers, I don’t think it’s practical for the majority of Yunohost users. I, for one, don’t have the bandwidth and CPU power to spare for running the more aggressive solutions. I just want my content and my server to be available for real people and not language models (just as it should be for all websites), and I suggest a solution within Yunohost, preferably turned on by default.
I don’t agree with the premise. None of my (or anyone else’s) content should be used for “AI” training except by prior written consent. AI scraping is stealing unless agreed to in advance.
With the risk of starting a religious war, “stealing” means depriving another party of something, while reading ‘for benefit’ may be an infringement
The larger battle I think is how AI is governed; whether it will only be private entities that reap the profits of jobs replaced by AI, and that the media/web will be full of mediocre (at best, nefarious most probable) AI generated content.
With enough push, AI may relieve humans of jobs they don’t enjoy with the profits flowing to the community, for example enabling an accountant that is replaced by AI to receive livable welfare while starting to write the novel he never came across to writing, or indeed let authors receive sufficient income with their own stories while enough revenue is generated by AI generated placeholders.
You are right, the correct judicial term is probably infringement. My point is that Yunohost users could do well with a simple mechanism to avoid such infringement. Unfortunately, it lies on us to establishish such mechanisms, because the AI scrapers do not honour copyright laws or common courtesy. They have to be fenced out. I am proposing that Yunohost could include that fence.
As for the rest of your post, I think it falls outside the scope of possible technical solutions in Yunohost, which is what this thread started with.
One way might be inserting links to external tarpits. It could be avoided easily, with some rules or heuristics on the scrapers’ side; we may obfuscate it with many fronting domains on the end of the tarpit, or by creating an entry for a subdomain of our own domain that points to such a trap. It would off-load the processing to such tarpits, but also be relatively easy for bad actors to circumvent.
It also doesn’t help that anyone can run such a scraper, rendering blocklists somewhat unwieldy.
That does remind me, OPNsense has a plugin for Crowdsec. My firewall harvests suspicious activity, which combined with input from your firewall and of others, will drop an IP on a blacklist or take it off.
It relembles @esist0 suggestion, but than automated.
Yes, something like that, maybe? Combined with ai.robots.txt, simply a list of IPs to block on firewall level would be manageable even for small servers. I see the point in redirecting scrapers to tarpits, but that seems rather more to manage and also, as you say, possible to circumvent.
So, in the end, what we might need is for the Yunohost firewall (which is ufw, if I remember correctly) to include a list of IPs to drop, and a script to update that list from time to time.
Alternatively, people might also be interested in an app which redirects the trawlers to tarpits.
Thanks for starting this discussion, I had also been thinking about this.
I’m sympathetic to the bot tarpit approach, but would be happy just to add them to fail2ban, and not let them waste any additional cpu cycles.
I while back I had installed darkvisitors to my website, and dismayed but not surprised to see that 90% of the traffic was coming from these llm/scraper bots.
I think a project like this would be great to add somehow to yunohost, and from a glance seems compatible:
Yes, this is possibly a good solution. I think fail2ban misses the point a bit, because this is not about stopping login attempts, which fail2ban does well. This is about stopping trawlers from accessing public-facing content, like Mastodon feeds, forums, blogs, webpages etc. that don’t require login. That looks more like a firewall’s task.