Well, that's it. I've officially started using #Anubis to protect my self-hosted #Forgejo instance.
I didn't want to do it at first, but my nginx and fail2ban configurations weren't efficient enough.
Down with LLMs!
Just released version 12.0.8 of the #TYPO3 #Crawler, with a number of fixes.
Thanks for all contributions. #HappyCrawling
How do you deal with web crawlers and bots on your personal websites?
There is a lot of automated traffic I can detect at my own. Some of yours have this short living captchas, that don't even need to be filled out and disappear in just a second.
While starting investigating in this topic, I can find as many info about anti-webcrawlers as anti-anti webcrawlers and get lost very soon.
What are solutions you have found for this?
@tdp_org unbelievable! I've set up #nepenthes tarpit in my personal blog and reached over 1 million requests from a #amazon #crawler alone in less than 3 months! Other bots typically gave up after crawling tens of thousands of pages of bulshit. Naturally, my robots.txt informs not to crawl the tarpit ...
Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)
I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.
Is this the future of terminal gaming? 🤯
⚔️ **ratthew** — A 3D dungeon crawler in the terminal.
🦀 Written in Rust!
🏗️ Built with @ratatui_rs + @bevy
⭐ GitHub: https://github.com/cxreiff/ratthew (WIP)
#rustlang #ratatui #tui #terminal #gaming #3d #bevy #commandline #dungeon #crawler
Der Druck auf öffentliche und gemeinnützige Infrastrukturen steigt. Ob nun Open-Access-Repositorien oder wie hier im Text #Wikimedia: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/
Es ist gut, dass die Inhalte in das Training von Künstlicher Intelligenz einbezogen werden, aber bei der unverhältnismäßigen Belastung als Ergebnis von kommerziellen Interessen überlege ich, ob es nicht eigentlich einen Ausgleich braucht.
在 LLM crawler 盛行的年代擋 bot...
在「把 wiki 搬回到家裡的機器上」之後,就更容易看出來上面的 loading 了 (因為目前上面只有一個站台)。 這個是 monitorix 的週圖: 這個是月圖: 搬回來後就一直有看到 crawler 的量在上面掃,一開始還沒管太多,後來發現愈來愈嚴重 (幾乎所有的 bot 都會因為你撐的住就加速),還是研究了在 Caddy 上擋 bot 的方案。 這邊採用兩個方案,一個是 IP-based 的,另外一個是 User-Agent-based 的。 IP-based 的部分用的是 caddy-defender 的方案,擋掉所有常見的 bot 網段 (包括了 cloud 以及 VPS 的網段): defender block { ranges aws azurepubliccloud…
#blocker #bot #caddy #crawler #defender #llm #php #web #wiki
Just for the record one of the most aggressive are those from #Microsoft & #Bing :
BingBot : 52.167.144.*
BingBot : 40.77.167.*
I had some intense visits from #OpenAI too:
OpenAI : 52.255.111.84-87
...at least from their #useragent
Nachdem diverse #ki #ai #crawler besonders respektvoll mit den öffentlichen Ressourcen von Open Source Projekten umgehen, habe ich mich dazu entschlossen eben diese auszusperren. Wir hatten in der Vergangenheit crawls, die im #monitoring als #ddos gewertet wurden.
Diverse AS erfreuen sich nun einem dauerhaften 429, einige wenige die es für alle kaputt machen…
🔗 RE: "Please stop externalizing your costs directly into my face"
AI training is controversial at best. If you say AI is trained fair you're either very blind to the reality of things, or very naive - or both. None of the big AI tools are trained ethically, and this example from SourceHut just shows it.
Je me demande si je ne vais pas faire ça... Clairement j'ai des pics de trafic venant de HK et SG qui font des recherches sur rss.gayfr.online... Des robots AI, sans aucun doute. Et difficiles à contrer car adresses IP multiples et user agent trompeur.
L'alternative étant de bloquer ces pays, mais la solution ne me plaît pas.
#YaCy
YaCy est un moteur de recherche open source décentralisé basé sur le principe des réseaux peer-to-peer (P2P). Il permet aux utilisateurs de parcourir et d’indexer le web de manière indépendante, garantissant ainsi la confidentialité des données sans serveur central. YaCy peut être utilisé pour des recherches personnelles, d’entreprise ou communautaires, offrant une alternative respectueuse de la vie privée aux moteurs de recherche traditionnels.
https://project4geeks.org/yacy-moteur-de-recherche-decentralise-3/
I understand the pain these projects feel from AI crawlers eating up their bandwidth. Banning just well-behaved crawlers from my own blog has eliminated 1/6 of my traffic.
#AI #crawler
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
🚀 Breaking news: #Developers are now playing whack-a-mole with #AI crawlers, 🤖 but instead of using moles, they're blocking whole countries! 🌍 Because nothing says 'technological advancement' like nuking an entire nation's #internet #access to stop pesky bots. 🙃
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/ #Crawler #Crisis #TechNews #HackerNews #ngated