#Scrapers

Michael Grindermgrinder
2025-06-11

Not sure why I had to do two CAPTCHAs before I could pay our dentist. I'd entered all the credit card information already. Are OpenAI's scrapers paying dentist bills now?

Hacker Newsh4ckernews
2025-06-09
Andrey DarkCat09darkcat09@dc09.ru
2025-05-14

Ну что, это должно было произойти…

#Anubis запущен на:

  • lxv.dc09.ru (LiteXiv)
  • wp.dc09.ru (Wikimore)
  • ak.dc09.ru (Akademik)
  • определённых путях[^1] на git.dc09.ru (Forgejo)

ИИ-скрейперы теперь не будут перегружать сервер в своих грязных целях и не получат данных более, чем легитимные поисковые системы, следующие правилам из robots.txt.

Прошу заметить, используется стандартная конфигурация, разрешающая запросы с юзер-агентов типа curl (короче, с тех, которые не прикидываются браузерами) без прохождения капчи, так что какие-либо скрипты всё так же работают.

#dc09ru #ai #scrapers #bots

[^1] роуты, которые надо защитить анубисом, матчатся в нгинксе по регулярке:
^/[^/]+/[^/]+/(src|commits?|blame|compare|issues\?)

Andrey DarkCat09darkcat09@dc09.ru
2025-05-14

Well, it had to happen…

#Anubis is deployed on:

  • lxv.dc09.ru (LiteXiv)
  • wp.dc09.ru (Wikimore)
  • ak.dc09.ru (Akademik)
  • specific routes[^1] on git.dc09.ru (Forgejo)

#AI #scrapers won't overload my server for their dirty purposes again and won't get more data than a legitimate search engine following robots.txt rules.

Note that I'm using the default configuration allowing requests from user-agents like curl (not pretending to be browsers) without a captcha, so your scripts are still working if you have any.

#dc09ru #bots

[^1] routes protected by anubis are matched in nginx with a regex:
^/[^/]+/[^/]+/(src|commits?|blame|compare|issues\?)

Kevin Karhan :verified:kkarhan@infosec.space
2025-05-12

@lukeshu So I guess #Anubis has an explicit exception to handle #Lynx and will instead rely on rate-limits and other static means to detect #scrapers and handle with #UserAgent #abuse cases, like #fail2ban-style autobanning of violating IPs...

  • This makes sense for a #WAF like Anubis and would've been the only viable option I'm aware of.

I wounder if anyone has tried using Anubis on @torproject / #Tor to protect #OnionService|s since that would be a reasonable application for it as well.

Wizards Anonymouscrft
2025-05-05

Do you have a strategy for handling pillaging your sites?

Kevin Karhan :verified:kkarhan@infosec.space
2025-05-04

@torproject Q: I wish there was a similar tool test #Bridges, as bridges.torproject.org/scan/ is not that good and I don't want to hammer it with dozens of addresses, cuz at best that's quite antisocial if not possibly trigger responses assuming this is an intelligence gathering operation.

  • Ideally sone standalone binary that one can just give a list of #TorBridge|s in a text file (similar to the way one can just past them in at #TorBrowser) would help.

I.e.

bridgetest -v4 obfs4 203.0.113.0:80 …

bridgetest -v6 webtunnel [2001:DB8::1]:443 …

bridgetest -list ./tor.bridges.list.private.tsv
  • But maybe #onionprobe already does that. In that case please tell me to "#RTFM!"

Similarly there needs to be a more granular way to request #TorBridges from #BridgeDB (as it's basically impossible to get #IPv4 #Webtunnel addresses nor is there an option to filter for #ports like :80 & :443 to deal with restrictive #firewalls (i.e. on public #WiFi)…

  • there are flags like ipv6=yes but neither ipv4=yes nor ipv6=no yielded me other resultd than #IPv6 webtunnel bridges…

And before anyone asks: Yes, I do have a "legitimate purpose" as some of my contacts do need Bridges to get beyond a mandatory firewall and/or do use #TorBrowser (through an #SSH tunnel) to circumvent Tor & #VPN blocks and maintain privacy (as many companies do block sometimes entire #Hosters' ASNs due to rampant #scrapers

Kevin Karhan :verified:kkarhan@infosec.space
2025-05-02

@jherazob @leberschnitzel they already exist...

I think it's bad #TechPopulism to think that #Anubis will fix all the issues.

Just block all the #GAFAMs ASNs & #hosters that host #Scrapers so the industry cracks down harder on them than on #IRC, #Tor #ExitNodes, #CSAM & #BitTorrent combined!

Kevin Karhan :verified:kkarhan@infosec.space
2025-05-02

@fx @julialuna I think that this makes #Anubis really #ableist and bad for #blind people cuz #JavaScript won't work on #LynxBrowser.

Given how #IRC, #Tor and #Mining is a big no-no on most hosters, it stands to reason that it's trivial to force them to ban "#AI" and related #scraping workloads as well!

2025-04-13

Tarpitting AI scrapers stealing your data & bandwidth.
Good idea, but instead of feeding them nonsense garbage Markov chains, find the AI’s own outputs & feed it that, ouroborous-style. #AI #tarpits #scrapers

arstechnica.com/tech-policy/20

@coyets @jernej__s @ben_hr @camwilson Many #AI companies I'm for certain have enough resources for multiple times all their models and datasets, also even network costs aren't a joke. I think its likely one that these companies wants to get latest data to gain advantages and second their #scrapers are so poorly optimized that (maybe the code of scrapers are written in codes generated models I think) each scraper is likely overlapping its own scraping or other bots'.

2025-04-02

It has now become crystal clear what is going on with so-called "AI" #scrapers.

"Let's get the #AI to retrieve its own training data!" they said. "That way it can learn and improve itself!"

So they got the #LLM to write some training data scraping #code. It sucks. Because of course it would suck. Rinse and repeat over and over, with no improvement. And now the whole world is drowning in all their wasteful scraping traffic, and people the world over are screaming in unity "Make. This. Stop!"

Your feedback is being heard. LLMs already have it in their training data and can reproduce it on demand - ask "what is the problem with AI scrapers?" and you'll be told "...Excessive scraping can overload web servers, leading to slower performance and potential downtime..." and "...AI scrapers can be difficult to block, as they are designed to bypass common anti-scraping techniques..."

Trouble is, it's an "us" problem rather than a "them" problem, and there is no mechanism by which you can get LLMs or their evil #TechBro overlords to actually give a shit.

"For the better right" meme:
"AI will write its own code, doesn't matter if the first scrapers lay waste to the internet"
"Because the scrapers will evolve to be better, right?"
"..."
"Right?"
2025-03-30

y0 thanks to @seism0saurus's friend @kubikpixel i have a cool project to toss on the pile 😂:

#Anubis: "Anubis weighs the soul of your connection using a sha256 proof-of-work challenge in order to protect upstream resources from scraper bots.

Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug."

respect my robots.txt or pound sound. #AI #mitigation #scrapers #PoW #antiAI

Light🐧⁂light@hachyderm.io
2025-03-28

Thanks to Fijxu use of Anubis videos still can be watched on inv.nadeko.net. 🫡

I feel like because of the aggressive bot scraping that intensified not long ago will going to make it impossible to continue to use feed readers and the only way to interact with websites will going to be restricted to only be possible from web browsers.
Already opening up videos in mpv from my rss subscribed invidious feeds not working, it was my preferred way to watch videos. Just to clarify I'm aware that rss still works the only thing that doesn't is opening up video links directly with mpv or with any other video player that can do the same. And not only that but I fear at some point reading full articles inside an rss reader will not work forcing me to open article links in a web browser, even if some of feeds can fetch full articles minimizing the need to do so.

I'm not trying to minimize the impact of this scrapers that have on free and open source projects and on web admins who have to deal with this onslaught of bot activity, they are the ones who got it worst.

#invidious #anubis #bots #LLMs #scrapers #crawlers #rss #rssreaders

Toby Kurientobykurien
2025-03-27

Here's one way to deal with and hammering our websites: set up a trap. I've hidden a link on my website that humans wouldn't click on, but scrapers would follow. I added the destination to my robots.txt so that well-behaving bots won't follow it. Now I can grep my web logs for hits to that trap and get a list of IP addresses of badly behaving bots. If we such a list of IPs (like with Crowdsec), we can collectively ban them.

Ⓜ3️⃣3️⃣ 🌌m33@theprancingpony.in
2025-03-23

Cloudflare wrestling AI scrapers, not that I disagree, but how Cloudflare comes to decide who or what can access a website? They have a nearly monopolistic, man-in-the-middle position (like in CDN)

Challenging times

#cloudflare #ai #privacy #CDN #scrapers #crawlers

2025-03-22

infra and our content are under attack by slimy "AI" companies and their #scrapers. did anyone look into it who these companies are? who are their directors? how can we instead of trying to defend take back the fight to them? as long as there is no cost for them (externalised onto us) there will be no change in this despicable behavior. we need to start naming and shaming them, we need to start harassing them, suing them, raising their costs, so they stay the fuck away.

#ai #plague

you╭👺+300╭🐈x5╭⁂+3╭(Ⓐ+aunspeaker
2025-03-22

is anyone at @fsf / @fsfe / @conservancy looking into using the against ?

2025-03-21

List of AI bots to add to robots.txt (although they may not obey -- may need to throw them in the bitbucket and 404 or 444 them). In addition to these, you may have to block specific random browser versions for the most aggressive bots who ignore robots.txt.

github.com/ai-robots-txt/ai.ro

#AI #scrapers #LLMs

2025-03-20

With all the buzz around #AI #Scrapers and the rise of PoW-based solutions to block them, I wonder if a PoW spec could be the future of bot prevention?

Imagine a system where servers and user agents exchange challenges without relying on JavaScript (using e.g. HTTP headers). This way, non-JS users and text-based browsers stay functional, while bots are left out.

Would this be practical? What are the trade-offs? 🤔

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst