#WebScraper

@reiver ⊼ (Charles) :batman:reiver
2025-05-21
@reiver ⊼ (Charles) :batman:reiver
2025-05-21

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco

@reiver ⊼ (Charles) :batman:reiver
2025-05-21

1/

If these researchers used a typical HTTP-based API that returns JSON, then —

What these researchers did is NOT scraping.

CC: @404mediaco

RE: 404media.co/researchers-scrape

"Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active."
N-gated Hacker Newsngate
2025-05-11

Oh joy, another "game-changing" named 🤖—because apparently, the internet was just crying out for one more script-kiddie to scrape and bloat their hard drives with HTML they’ll never use. Congrats, user, your contribution to the overload of useless data is truly groundbreaking. 🚀🎉
github.com/jaypyles/Scraperr

@reiver ⊼ (Charles) :batman:reiver
2025-04-17
@reiver ⊼ (Charles) :batman:reiver
2025-04-17
@reiver ⊼ (Charles) :batman:reiver
2025-04-17
@reiver ⊼ (Charles) :batman:reiver
2025-04-17
@reiver ⊼ (Charles) :batman:reiver
2025-04-17
Enzyklopädie Roter Kreiswissen@sozial.roter-kreis.de
2024-03-23

Um im föderalen Verband zu erfahren, welche Aktivitäten es in bestimmten Tätigkeitsbereichen gibt, wird im DRK mit Webscraping der Websites der Kreis- und Landesverbände experimentiert.
➡️ drk-wohlfahrt.de/blog/eintrag/ ("Wie Data Science das DRK in der Wohnungslosenhilfe unterstützen kann")

#DRK #RotesKreuz #DataScience #DataScienceHub #Webscraping #Webscraper #DSSG #Wohlfahrt #Wohlfahrtspflege

Inautiloinautilo
2024-02-15


The text file that runs the internet · Is a basic social contract of the web falling apart? ilo.im/15xzdk

_____

Vedran Mandićvekzdran@hachyderm.io
2024-01-27

I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !

You can try it for free online: tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈

#node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev

Vedran Mandićvekzdran@hachyderm.io
2024-01-25

I am so happy to get recommendations on fly.io here. I managed to finally deploy my NodeJS web scraper app. World meet Tris: tris.fly.dev

#webscraper #scraping #nodejs #seotools #seo

Inautiloinautilo
2024-01-06
@larsborn has movedlarsborn
2023-10-14

Went through a series of side-quests leading to rabbit holes containing kettles of fish to finally get some data from a into a persistent database . Learned a few things on the way and documented them in a blag post: blag.nullteilerfrei.de/2023/10

I did it again!

So I created #MastoBot, a generic #Python Mastodon bot that allows anyone to create a bot.

I created a few versions, and I use it for @3dprinting. But naturally, knowing how to implement it and developing functions, I need a use case.

So after a discussion this morning. I spent the entire day writing @Python. Yes, I did it again.

However, this one now has a built-in #webscraper to cross-post new posts fromhttps://discuss.python.org/, because why not.

This @Python required a few things, and updates were made to #MastoBot. I had to make it even more generic, implement an overkil datastore with #Redis, and extend the config system.

@Python will behave exactly like @3dprinting with the added feature of crossposts. These posts will, however, be "follower only" posts, to not polute #Python and just flood everything initially.

The bot will #boost parent posts, allowing for threads and discussions to be created.

The source code will be out tomorrow, just cleaning up.

2023-04-13

I have found you. Turns out if you access Beautiful Soup elements and don't call decompose on it, it will cause memory leaks.

#python #beautifulsoup #Webscraper

2022-11-20

#OpenRefine, est un logiciel libre de nettoyage et de mise en forme de données, il peut être aussi utilisé comme couteau-suisse de Wikimedia, pour :
- importer des données dans Wikidata
-importer des données dans Commons (nouveauté 2021 🎉)
-importer des fichiers dans Commons (nouveauté 2022 🎉)
- récupérer des pages de n'importe quel projet Wikimédia
- récupérer n'importe quelle page sur Internet (#webScraper).
Merci @belett !
Source : programme.wikiconvention.fr/#_
#WikiConvention

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst