[Proposal] AWS block app

g1smo · January 10, 2025, 2:04am

Hello yunohost community.
I have an app proposal, with a bit of a radical idea behind it but for very practical purposes. I am wondering if there is interest in the community for such an app and I will reason the need I identify for it.

What the app does is block whole blocks of IP ranges belonging to Amazon at the firewall level. Why? To block AI scrapers throwing unbrearable load to your server.

I have been maintaining a server for a dozen dozens of users for a couple of years now. It’s not a bad server. Usually the load oscillate between 2 and 3 but recently, it often went over 20. After investigating why gitea was taking up so much memory and CPU, I noticed that there’s a giant flood of requests coming from amazon owned IP addresses.

A lot of these seemed to be bots of AI companies scraping our forge. I am aware that writing a proper robots.txt file can opt you out. But we still want our forge to be indexed by search engines. And I do not want to add exceptions into robots.txt for every new private service that is melting our planet. I think such damage should be opt-in and not opt-out. We maintain our infrastructure for people to use not machine to help capital profit.

So, the point of this app would be a very easy way to improve performance of smaller servers. One barrier is the fact that letsencrypt uses AWS EC2 instances, but that can be fixed via renewal hooks. The other problem is that docker which many people might be using is also hosted on AWS. But warning in the app description can let people know that functionality may be deteriorated after installing this app.

Afte AWS was blocked, there was a smaller flood of requests coming in, most were from Meta. After blocking facebook, a much smaller flood consisted of requests of micro$oft (I’m not kidding). After blocking some M$ blocks, there was just a tiny trickle of requests coming from various other parts of the internet. And this was all looking at our gitea access log.

The app could have options to also block microsoft, meta and any other malignant actors. That would involve maintenance work of making moderation lists, maybe categorizing network based on why and how they are malignant.

So, dear community, what do you think?

Kind regards, Jurij (from kompot)

aris · January 10, 2025, 10:41am

What a nice idea

loowiz · February 9, 2025, 12:53pm

Really interesting idea. I like the idea of blocking these crawlers, but I’d imagine blocking all Amazon-owned ip addresses would be damaging for federated apps, like Nextcloud and Mastodon. It would mean anyone who hosts one of these apps on Amazon servers won’t be able to interact with servers running this proposed app, right?

Also, here’s a related post: Prevent LLM scrapers/trawlers? - #12 by okam_rzr

g1smo · February 10, 2025, 11:36pm

Blocking crawlers was why I got the idea.
Of course the app would come with a warning that connectivity may be hindered.

Federating with infrastructure on Amazon can be damaging by itself. I for one do not want it. There are loads of better options to choose from.

While blocking Amazon (and various other nasty providers, Google, for example, keeps violating users’ rights and it complicit in the genocide in Gaza) is one app idea, blocking AI crawlers is another.

We should find a way to share a community curated blocklist of known AI crawlers. Crowdsec looks like an interesting solution that may work.

It would be important to think of a good policy on how to curate the list.

Looking for too many requests in the access log of gitea is a good starting step.

Edit: whoops, looks like crowdsec is also running on Amazon.
I’m a bit surprised how many services run on amzn infrastructure. Should we not care where we connect to on the internet?