Not sure why I had to do two CAPTCHAs before I could pay our dentist. I'd entered all the credit card information already. Are OpenAI's scrapers paying dentist bills now? #captcha #scrapers #llm
Using Awk to find out the FBI was paying scrapers to find Torswats
https://blog.freespeechextremist.com/blog/fse-vs-fbi.html
#HackerNews #Awk #FBI #Scrapers #TorSwats #Cybersecurity #FreeSpeech
Ну что, это должно было произойти…
#Anubis запущен на:
ИИ-скрейперы теперь не будут перегружать сервер в своих грязных целях и не получат данных более, чем легитимные поисковые системы, следующие правилам из robots.txt.
Прошу заметить, используется стандартная конфигурация, разрешающая запросы с юзер-агентов типа curl (короче, с тех, которые не прикидываются браузерами) без прохождения капчи, так что какие-либо скрипты всё так же работают.
[^1] роуты, которые надо защитить анубисом, матчатся в нгинксе по регулярке:^/[^/]+/[^/]+/(src|commits?|blame|compare|issues\?)
Well, it had to happen…
#Anubis is deployed on:
#AI #scrapers won't overload my server for their dirty purposes again and won't get more data than a legitimate search engine following robots.txt rules.
Note that I'm using the default configuration allowing requests from user-agents like curl (not pretending to be browsers) without a captcha, so your scripts are still working if you have any.
[^1] routes protected by anubis are matched in nginx with a regex:^/[^/]+/[^/]+/(src|commits?|blame|compare|issues\?)
@lukeshu So I guess #Anubis has an explicit exception to handle #Lynx and will instead rely on rate-limits and other static means to detect #scrapers and handle with #UserAgent #abuse cases, like #fail2ban-style autobanning of violating IPs...
I wounder if anyone has tried using Anubis on @torproject / #Tor to protect #OnionService|s since that would be a reasonable application for it as well.
@torproject Q: I wish there was a similar tool test #Bridges, as https://bridges.torproject.org/scan/ is not that good and I don't want to hammer it with dozens of addresses, cuz at best that's quite antisocial if not possibly trigger responses assuming this is an intelligence gathering operation.
I.e.
bridgetest -v4 obfs4 203.0.113.0:80 …
bridgetest -v6 webtunnel [2001:DB8::1]:443 …
bridgetest -list ./tor.bridges.list.private.tsv
Similarly there needs to be a more granular way to request #TorBridges from #BridgeDB (as it's basically impossible to get #IPv4 #Webtunnel addresses nor is there an option to filter for #ports like :80
& :443
to deal with restrictive #firewalls (i.e. on public #WiFi)…
ipv6=yes
but neither ipv4=yes
nor ipv6=no
yielded me other resultd than #IPv6 webtunnel
bridges…And before anyone asks: Yes, I do have a "legitimate purpose" as some of my contacts do need Bridges to get beyond a mandatory firewall and/or do use #TorBrowser (through an #SSH tunnel) to circumvent Tor & #VPN blocks and maintain privacy (as many companies do block sometimes entire #Hosters' ASNs due to rampant #scrapers…
@jherazob @leberschnitzel they already exist...
I think it's bad #TechPopulism to think that #Anubis will fix all the issues.
Just block all the #GAFAMs ASNs & #hosters that host #Scrapers so the industry cracks down harder on them than on #IRC, #Tor #ExitNodes, #CSAM & #BitTorrent combined!
@fx @julialuna I think that this makes #Anubis really #ableist and bad for #blind people cuz #JavaScript won't work on #LynxBrowser.
Given how #IRC, #Tor and #Mining is a big no-no on most hosters, it stands to reason that it's trivial to force them to ban "#AI" and related #scraping workloads as well!
@coyets @jernej__s @ben_hr @camwilson Many #AI companies I'm for certain have enough resources for multiple times all their models and datasets, also even network costs aren't a joke. I think its likely one that these companies wants to get latest data to gain advantages and second their #scrapers are so poorly optimized that (maybe the code of scrapers are written in codes generated models I think) each scraper is likely overlapping its own scraping or other bots'.
It has now become crystal clear what is going on with so-called "AI" #scrapers.
"Let's get the #AI to retrieve its own training data!" they said. "That way it can learn and improve itself!"
So they got the #LLM to write some training data scraping #code. It sucks. Because of course it would suck. Rinse and repeat over and over, with no improvement. And now the whole world is drowning in all their wasteful scraping traffic, and people the world over are screaming in unity "Make. This. Stop!"
Your feedback is being heard. LLMs already have it in their training data and can reproduce it on demand - ask "what is the problem with AI scrapers?" and you'll be told "...Excessive scraping can overload web servers, leading to slower performance and potential downtime..." and "...AI scrapers can be difficult to block, as they are designed to bypass common anti-scraping techniques..."
Trouble is, it's an "us" problem rather than a "them" problem, and there is no mechanism by which you can get LLMs or their evil #TechBro overlords to actually give a shit.
y0 thanks to @seism0saurus's friend @kubikpixel i have a cool project to toss on the pile 😂:
#Anubis: "Anubis weighs the soul of your connection using a sha256 proof-of-work challenge in order to protect upstream resources from scraper bots.
Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug."
respect my robots.txt or pound sound. #AI #mitigation #scrapers #PoW #antiAI
Thanks to Fijxu use of Anubis videos still can be watched on inv.nadeko.net. 🫡
I feel like because of the aggressive bot scraping that intensified not long ago will going to make it impossible to continue to use feed readers and the only way to interact with websites will going to be restricted to only be possible from web browsers.
Already opening up videos in mpv from my rss subscribed invidious feeds not working, it was my preferred way to watch videos. Just to clarify I'm aware that rss still works the only thing that doesn't is opening up video links directly with mpv or with any other video player that can do the same. And not only that but I fear at some point reading full articles inside an rss reader will not work forcing me to open article links in a web browser, even if some of feeds can fetch full articles minimizing the need to do so.
I'm not trying to minimize the impact of this scrapers that have on free and open source projects and on web admins who have to deal with this onslaught of bot activity, they are the ones who got it worst.
#invidious #anubis #bots #LLMs #scrapers #crawlers #rss #rssreaders
Here's one way to deal with #AI #bots and #scrapers hammering our websites: set up a trap. I've hidden a link on my website that humans wouldn't click on, but scrapers would follow. I added the destination to my robots.txt so that well-behaving bots won't follow it. Now I can grep my web logs for hits to that trap and get a list of IP addresses of badly behaving bots. If we #crowdsource such a list of IPs (like with Crowdsec), we can collectively ban them.
infra and our content are under attack by slimy "AI" companies and their #scrapers. did anyone look into it who these companies are? who are their directors? how can we instead of trying to defend take back the fight to them? as long as there is no cost for them (externalised onto us) there will be no change in this despicable behavior. we need to start naming and shaming them, we need to start harassing them, suing them, raising their costs, so they stay the fuck away.
List of AI bots to add to robots.txt (although they may not obey -- may need to throw them in the bitbucket and 404 or 444 them). In addition to these, you may have to block specific random browser versions for the most aggressive bots who ignore robots.txt.
https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
With all the buzz around #AI #Scrapers and the rise of PoW-based solutions to block them, I wonder if a PoW spec could be the future of bot prevention?
Imagine a system where servers and user agents exchange challenges without relying on JavaScript (using e.g. HTTP headers). This way, non-JS users and text-based browsers stay functional, while bots are left out.
Would this be practical? What are the trade-offs? 🤔