#Scraper

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-07

@fanf Sure that does make sense. I'll try to verify jmeter indeed doesn't reuse connections (I already have debug logging in place that should tell me).

If that's really the reason, I guess the sane thing to do is to add a hint to the docs to just disable TLS for very busy sites. The intended usecase for #swad is operation behind #nginx to serve its "auth_request". I don't intend to implement HTTP/2 or beyond, but it would be pretty pointless here anyways, nginx defaults to HTTP/1.0 for proxy requests and can be configured to use HTTP/1.1 instead, but *still* doesn't reuse connections by default, and my experiments so far to enable it weren't successful, maybe I didn't fully understand it yet. Using TLS behind nginx would make sense from a "defense in depth" point of view, but it's probably impractical once your load exceeds a certain threshold.

For background how I arrived there, I observed stupid #AI #scraper #bots clog my DSL connection by downloading gigabytes of build logs produced by my #poudriere. They're not secret in any way and having a simple way to share them is great for community bug hunting, but this had to stop. I had a simple C library doing a fully portable reactor event loop on top of select (so, not really scalable), and some very limited HTTP/1.1 server code from experiments with TOR hidden services ... so I put that together to add some web-form + cookies auth to my private nginx to lock out the bots. Later, I added a "guest login" doing the same "proof of work" stuff known from #anubis, and then I suddenly had the idea in mind to make my little service (that already solved the problem perfectly for myself) suitable for large-scale installations. So, added kqueue, epoll etc support, added a "multi-reactor with acceptor-connector" design, etc .... and now I'm a bit frustrated enabling TLS spoils all the performance ๐Ÿ™ˆ

Torf und Schneetorf@c.im
2025-05-31

The most disgusting feature of this relatively new #AI #scraper |s plague is that they are about to defile everything we like in the *good* internet.

Images with relevant #AltText? Perfect training materials for text-to-image generative models.

Static webpages? No #Anubis - no problem to scrape.

#Anubis uses proof-of-work ( #PoW ), which implies either #JavaScript or manual instructions. No, it is a good solution... Best of the worst (as if there were any good ones...)

Last days I learned that (1) #Tor has a #PoW mechanism (2) Anubis seems to somehow whitelist #lynx browser, allowing no-JS Lynx users in (a big favour for #accessibility and #smolweb ). Good (let's hope all these will persist).

Kiwixkiwix
2025-05-30

MWoffliner, the @mediawiki has been released in version 1.15!

1.15 brings a significant amount of improvements:
* Support of wide used (outside Wikimedia) "ActionParse" API
* Use latest libzim (we were stuck with an older version) which fixes many suggestion problems with non-latin alphabets
* Move to Node.js 24 + many install fixes
* Better & sophisticated remote error handling

Full changelog at github.com/openzim/mwoffliner/

Available as container image and Npmjs package!

Jan โ˜•๐ŸŽผ๐ŸŽนโ˜๏ธ๐Ÿ‹๏ธโ€โ™‚๏ธjan@kcore.org
2025-05-25

Another filthy bot really hititng my server hard...

"Scrapy/2.11.2 (+scrapy.org)"

URL doesn't work. IP (35.239.86.156) points to 156.86.239.35.bc.googleusercontent.com.

Blocked it is. I'll perhaps unblock it later when I have more time.

#admin #scraper

@reiver โŠผ (Charles) :batman:reiver
2025-05-21
@reiver โŠผ (Charles) :batman:reiver
2025-05-21

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping โ€” as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco

@reiver โŠผ (Charles) :batman:reiver
2025-05-21

1/

If these researchers used a typical HTTP-based API that returns JSON, then โ€”

What these researchers did is NOT scraping.

CC: @404mediaco

RE: 404media.co/researchers-scrape

"Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discordโ€™s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active."
2025-05-06

Two new things #today: Wanted to take a picture of the skyline because it wasnโ€™t there(!) - couldnโ€™t see the high rises on 7th from 8th, low fog. And my first ride on a subway car where all the banner ads were actually screens - they all changed to the same ad at one point.
#NYC #NewYorkCity #subway #advertisements #commuter #TimesSquare #train #ride #strap #hanger #Manhattan #low #fog #sky #scraper #high #rise #tower #skyline #mist #rain #financial #district #world #trade #center #WTC

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-05-05

Just released: #swad 0.6 ๐Ÿป

Looking to add #authentication to your reverse proxy (e.g. #nginx)? Or some protection against #scraper #bots of #AI #companies? Swad might help you!

Swad is the "Simple Web Authentication Daemon", written in pure #C with very few external dependencies (just zlib and, depending on build options, OpenSSL/LibreSSL and libpam). It offers a login form for configurable credential checkers (currently by executing an external tool, checking a password-file with bcrypt hashes, or asking #PAM), and a "guest login" requiring the client browser to solve the same cyrpto puzzle known from e.g. #Anubis. Optional #https support is also included. It currently compiles to a binary of less than 200kiB. I'm using it myself on a #FreeBSD machine, it's also tested on #Linux and it *should* work on any "POSIXy" system.

Version 0.6 brings lots of fixes and improvements, but most notably the feature to reload configuration via SIGHUP, which for example enables certificate rollover without any service interruption.

Check out the latest release, grab the .txz tarball, and build/install it! ๐Ÿ˜Ž

github.com/Zirias/swad

Patrick Wu :neocat_flag_bi:patrick@o0o.social
2025-04-28

And it's much better

fuck you OpenAI and Amazon

#ai #scraper #cloud #openai #amazon

The graph showing a lower network connections rate
2025-04-22

Update: I reported the bot. Thanks.

A Mastodon bot account at mastodon.cloud scans the fediverse, scrapes selected web pages shared there, rewrites them with AI, posts them to its own site, and shares on Mastodon as tech news the rewritten AI slop. The bot scraped a post of mine (including the attached image) within minutes of my federated blog publishing it.

Is it worth flagging the bot and reporting it to its instance? Are the mods likely to take action?

#mastodon #moderation #ai #scraper

Tom :damnified:thomas@metalhead.club
2025-04-22

Anubis: self hostable scraper defense software | Anubis

anubis.techaro.lol/

#ai #bots #scraper

2025-04-19

Wikimedia is buckling under the weight of AI scrapers.

"Our content is free, our infrastructure is not".

diff.wikimedia.org/2025/04/01/

#wikipedia #ai #scraper

@reiver โŠผ (Charles) :batman:reiver
2025-04-17
@reiver โŠผ (Charles) :batman:reiver
2025-04-17
@reiver โŠผ (Charles) :batman:reiver
2025-04-17
@reiver โŠผ (Charles) :batman:reiver
2025-04-17
@reiver โŠผ (Charles) :batman:reiver
2025-04-17
Allanon ๐Ÿ‡ฎ๐Ÿ‡น :amiga:allanon@mastodon.uno
2025-04-11

Just for the record one of the most aggressive are those from #Microsoft & #Bing :

BingBot : 52.167.144.*
BingBot : 40.77.167.*

I had some intense visits from #OpenAI too:
OpenAI : 52.255.111.84-87

...at least from their #useragent

#bot #crawler #scraper

2025-04-03

๐Ÿ˜…๐Ÿ˜…๐Ÿ˜…๐Ÿ˜…๐Ÿ˜…
#cats #catsofmastodon #cat #scraper

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst