#crawler

Camelia :tranarchy_a_nonbinary: 🇵🇸camelia@tech.lgbt
2025-05-24

Well, that's it. I've officially started using #Anubis to protect my self-hosted #Forgejo instance.

I didn't want to do it at first, but my nginx and fail2ban configurations weren't efficient enough.

Down with LLMs!

#LLM #crawler

Tomas Norre :verified:tomasnorre@phpc.social
2025-05-23

Just released version 12.0.8 of the #TYPO3 #Crawler, with a number of fixes.

Thanks for all contributions. #HappyCrawling

github.com/tomasnorre/crawler/

2025-05-21

How do you deal with web crawlers and bots on your personal websites?
There is a lot of automated traffic I can detect at my own. Some of yours have this short living captchas, that don't even need to be filled out and disappear in just a second.
While starting investigating in this topic, I can find as many info about anti-webcrawlers as anti-anti webcrawlers and get lost very soon.

What are solutions you have found for this?

#website #crawler #bot #defense

2025-05-14

@tdp_org unbelievable! I've set up #nepenthes tarpit in my personal blog and reached over 1 million requests from a #amazon #crawler alone in less than 3 months! Other bots typically gave up after crawling tens of thousands of pages of bulshit. Naturally, my robots.txt informs not to crawl the tarpit ...

Larvitz :fedora: :redhat:Larvitz@burningboard.net
2025-05-02

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

Orhun Parmaksız 👾orhun@fosstodon.org
2025-04-29

Is this the future of terminal gaming? 🤯

⚔️ **ratthew** — A 3D dungeon crawler in the terminal.

🦀 Written in Rust!

🏗️ Built with @ratatui_rs + @bevy

⭐ GitHub: github.com/cxreiff/ratthew (WIP)

#rustlang #ratatui #tui #terminal #gaming #3d #bevy #commandline #dungeon #crawler

2025-04-16

Der Druck auf öffentliche und gemeinnützige Infrastrukturen steigt. Ob nun Open-Access-Repositorien oder wie hier im Text #Wikimedia: diff.wikimedia.org/2025/04/01/

Es ist gut, dass die Inhalte in das Training von Künstlicher Intelligenz einbezogen werden, aber bei der unverhältnismäßigen Belastung als Ergebnis von kommerziellen Interessen überlege ich, ob es nicht eigentlich einen Ausgleich braucht.

#Crawler #KI #OpenAccess

Gea-Suan Lingslin@abpe.org
2025-04-13

在 LLM crawler 盛行的年代擋 bot...

在「把 wiki 搬回到家裡的機器上」之後,就更容易看出來上面的 loading 了 (因為目前上面只有一個站台)。 這個是 monitorix 的週圖: 這個是月圖: 搬回來後就一直有看到 crawler 的量在上面掃,一開始還沒管太多,後來發現愈來愈嚴重 (幾乎所有的 bot 都會因為你撐的住就加速),還是研究了在 Caddy 上擋 bot 的方案。 這邊採用兩個方案,一個是 IP-based 的,另外一個是 User-Agent-based 的。 IP-based 的部分用的是 caddy-defender 的方案,擋掉所有常見的 bot 網段 (包括了 cloud 以及 VPS 的網段): defender block { ranges aws azurepubliccloud…

blog.gslin.org/archives/2025/0

#blocker #bot #caddy #crawler #defender #llm #php #web #wiki

Allanon 🇮🇹 :amiga:allanon@mastodon.uno
2025-04-11

Just for the record one of the most aggressive are those from #Microsoft & #Bing :

BingBot : 52.167.144.*
BingBot : 40.77.167.*

I had some intense visits from #OpenAI too:
OpenAI : 52.255.111.84-87

...at least from their #useragent

#bot #crawler #scraper

Michiel Scholtendiginaut
2025-04-07

Why I am sort of afraid to share my projects now, and link to my own Git server dammit.nl/afraid-to-git.html

2025-04-06

Nachdem diverse #ki #ai #crawler besonders respektvoll mit den öffentlichen Ressourcen von Open Source Projekten umgehen, habe ich mich dazu entschlossen eben diese auszusperren. Wir hatten in der Vergangenheit crawls, die im #monitoring als #ddos gewertet wurden.

Diverse AS erfreuen sich nun einem dauerhaften 429, einige wenige die es für alle kaputt machen…

Sad to see #lore had to check I wasn't a bot to get through to the archive. I guess these are the #crawler ridden times we live in now.

kevin ⁂ (he/him)KevinGimbel@fosstodon.org
2025-03-31

🔗 RE: "Please stop externalizing your costs directly into my face"

AI training is controversial at best. If you say AI is trained fair you're either very blind to the reality of things, or very naive - or both. None of the big AI tools are trained ethically, and this example from SourceHut just shows it.

👉 kevingimbel.de/link-blog/re-ht

#linkblog #SuggestedRead #AI #Crawler

🌈 BarbaPulpe 😇 ᴹᵃˢᵗᵒᵈᵒⁿbarbapulpe@gayfr.social
2025-03-28

Je me demande si je ne vais pas faire ça... Clairement j'ai des pics de trafic venant de HK et SG qui font des recherches sur rss.gayfr.online... Des robots AI, sans aucun doute. Et difficiles à contrer car adresses IP multiples et user agent trompeur.

L'alternative étant de bloquer ces pays, mais la solution ne me plaît pas.

agate.blue/2025/03/27/Pi%C3%A9

#gayfr #crawler

2025-03-27

#YaCy

YaCy est un moteur de recherche open source décentralisé basé sur le principe des réseaux peer-to-peer (P2P). Il permet aux utilisateurs de parcourir et d’indexer le web de manière indépendante, garantissant ainsi la confidentialité des données sans serveur central. YaCy peut être utilisé pour des recherches personnelles, d’entreprise ou communautaires, offrant une alternative respectueuse de la vie privée aux moteurs de recherche traditionnels.

project4geeks.org/yacy-moteur-

Kevin Dominik Kortekdkorte@fosstodon.org
2025-03-27

I understand the pain these projects feel from AI crawlers eating up their bandwidth. Banning just well-behaved crawlers from my own blog has eliminated 1/6 of my traffic.
#AI #crawler
arstechnica.com/ai/2025/03/dev

N-gated Hacker Newsngate
2025-03-25

🚀 Breaking news: are now playing whack-a-mole with crawlers, 🤖 but instead of using moles, they're blocking whole countries! 🌍 Because nothing says 'technological advancement' like nuking an entire nation's to stop pesky bots. 🙃
arstechnica.com/ai/2025/03/dev

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst