Lmst

3/

For more on scraping (as in web-scraping) see here:
https://mastodon.social/@reiver/114353728684249608

#Scraper #Scraping #WebScraper #WebScraping

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco

#Scraper #Scraping #WebScraper #WebScraping

1/

If these researchers used a typical HTTP-based API that returns JSON, then —

What these researchers did is NOT scraping.

CC: @404mediaco

RE: https://www.404media.co/researchers-scrape-2-billion-discord-messages-and-publish-them-online/

#Scraper #Scraping #WebScraper #WebScraping

"Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active."

Oh joy, another "game-changing" #webscraper named #Scraperr 🤖—because apparently, the internet was just crying out for one more script-kiddie #tool to scrape and bloat their hard drives with HTML they’ll never use. Congrats, #GitHub user, your contribution to the overload of useless data is truly groundbreaking. 🚀🎉
https://github.com/jaypyles/Scraperr #dataoverload #scriptkiddie #HackerNews #ngated

Scraperr – A Self Hosted Webscraper

https://github.com/jaypyles/Scraperr

#HackerNews #Scraperr #Webscraper #SelfHosted #TechTools #OpenSource

Um im föderalen Verband zu erfahren, welche Aktivitäten es in bestimmten Tätigkeitsbereichen gibt, wird im DRK mit Webscraping der Websites der Kreis- und Landesverbände experimentiert.
➡️ https://drk-wohlfahrt.de/blog/eintrag/mit-webscraping-data-science-die-wohnungslosenhilfen-im-drk-verstehen.html ("Wie Data Science das DRK in der Wohnungslosenhilfe unterstützen kann")

#DRK #RotesKreuz #DataScience #DataScienceHub #Webscraping #Webscraper #DSSG #Wohlfahrt #Wohlfahrtspflege

#Development #Analyses
The text file that runs the internet · Is a basic social contract of the web falling apart? https://ilo.im/15xzdk

_____
#AI #AiModel #GenerativeAI #WebBot #WebCrawler #WebScraper #SearchEngine #Website #Blog #RobotsTxt

I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !

You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈

#node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev

I am so happy to get recommendations on fly.io here. I managed to finally deploy my NodeJS web scraper app. World meet Tris: https://tris.fly.dev

#webscraper #scraping #nodejs #seotools #seo

#Development #Collections
Dark Visitors · A list of known AI agents on the internet https://ilo.im/15xjhu

_____
#WebDev #WebScraper #AI #GenerativeAI #Chatbot #Backend #UserAgent #RobotsTxt

Went through a series of side-quests leading to rabbit holes containing kettles of fish to finally get some data from a #webscraper into a persistent database #arangodb. Learned a few things on the way and documented them in a blag post: https://blag.nullteilerfrei.de/2023/10/13/install-nsq-on-debian-with-init-d-and-nginx/

#cryptography #EllipticCurves #nginx #nsq #debian #pyArango #brainpool

Writing a Web Scraper in Rust using Reqwest: https://www.shuttle.rs/blog/2023/09/13/web-scraping-rust-reqwest

#rust #reqwest #webscraper

I did it again!

So I created #MastoBot, a generic #Python Mastodon bot that allows anyone to create a bot.

I created a few versions, and I use it for @3dprinting. But naturally, knowing how to implement it and developing functions, I need a use case.

So after a discussion this morning. I spent the entire day writing @Python. Yes, I did it again.

However, this one now has a built-in #webscraper to cross-post new posts fromhttps://discuss.python.org/, because why not.

This @Python required a few things, and updates were made to #MastoBot. I had to make it even more generic, implement an overkil datastore with #Redis, and extend the config system.

@Python will behave exactly like @3dprinting with the added feature of crossposts. These posts will, however, be "follower only" posts, to not polute #Python and just flood everything initially.

The bot will #boost parent posts, allowing for threads and discussions to be created.

The source code will be out tomorrow, just cleaning up.

I have found you. Turns out if you access Beautiful Soup elements and don't call decompose on it, it will cause memory leaks.

#python #beautifulsoup #Webscraper

#OpenRefine, est un logiciel libre de nettoyage et de mise en forme de données, il peut être aussi utilisé comme couteau-suisse de Wikimedia, pour :
- importer des données dans Wikidata
-importer des données dans Commons (nouveauté 2021 🎉)
-importer des fichiers dans Commons (nouveauté 2022 🎉)
- récupérer des pages de n'importe quel projet Wikimédia
- récupérer n'importe quelle page sur Internet (#webScraper).
Merci @belett !
Source : https://programme.wikiconvention.fr/#_session-openrefine
#WikiConvention

#webscraper

Client Info