3/
For more on scraping (as in web-scraping) see here:
https://mastodon.social/@reiver/114353728684249608
CC: @404mediaco
3/
For more on scraping (as in web-scraping) see here:
https://mastodon.social/@reiver/114353728684249608
CC: @404mediaco
2/
Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.
If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.
...
And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.
CC: @404mediaco
1/
If these researchers used a typical HTTP-based API that returns JSON, then —
What these researchers did is NOT scraping.
CC: @404mediaco
RE: https://www.404media.co/researchers-scrape-2-billion-discord-messages-and-publish-them-online/
Oh joy, another "game-changing" #webscraper named #Scraperr 🤖—because apparently, the internet was just crying out for one more script-kiddie #tool to scrape and bloat their hard drives with HTML they’ll never use. Congrats, #GitHub user, your contribution to the overload of useless data is truly groundbreaking. 🚀🎉
https://github.com/jaypyles/Scraperr #dataoverload #scriptkiddie #HackerNews #ngated
Scraperr – A Self Hosted Webscraper
https://github.com/jaypyles/Scraperr
#HackerNews #Scraperr #Webscraper #SelfHosted #TechTools #OpenSource
Um im föderalen Verband zu erfahren, welche Aktivitäten es in bestimmten Tätigkeitsbereichen gibt, wird im DRK mit Webscraping der Websites der Kreis- und Landesverbände experimentiert.
➡️ https://drk-wohlfahrt.de/blog/eintrag/mit-webscraping-data-science-die-wohnungslosenhilfen-im-drk-verstehen.html ("Wie Data Science das DRK in der Wohnungslosenhilfe unterstützen kann")
#DRK #RotesKreuz #DataScience #DataScienceHub #Webscraping #Webscraper #DSSG #Wohlfahrt #Wohlfahrtspflege
#Development #Analyses
The text file that runs the internet · Is a basic social contract of the web falling apart? https://ilo.im/15xzdk
_____
#AI #AiModel #GenerativeAI #WebBot #WebCrawler #WebScraper #SearchEngine #Website #Blog #RobotsTxt
I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !
You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.
Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈
#node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev
I am so happy to get recommendations on fly.io here. I managed to finally deploy my NodeJS web scraper app. World meet Tris: https://tris.fly.dev
#Development #Collections
Dark Visitors · A list of known AI agents on the internet https://ilo.im/15xjhu
_____
#WebDev #WebScraper #AI #GenerativeAI #Chatbot #Backend #UserAgent #RobotsTxt
Went through a series of side-quests leading to rabbit holes containing kettles of fish to finally get some data from a #webscraper into a persistent database #arangodb. Learned a few things on the way and documented them in a blag post: https://blag.nullteilerfrei.de/2023/10/13/install-nsq-on-debian-with-init-d-and-nginx/
#cryptography #EllipticCurves #nginx #nsq #debian #pyArango #brainpool
Writing a Web Scraper in Rust using Reqwest: https://www.shuttle.rs/blog/2023/09/13/web-scraping-rust-reqwest
I did it again!
So I created #MastoBot, a generic #Python Mastodon bot that allows anyone to create a bot.
I created a few versions, and I use it for @3dprinting. But naturally, knowing how to implement it and developing functions, I need a use case.
So after a discussion this morning. I spent the entire day writing @Python. Yes, I did it again.
However, this one now has a built-in #webscraper to cross-post new posts fromhttps://discuss.python.org/, because why not.
This @Python required a few things, and updates were made to #MastoBot. I had to make it even more generic, implement an overkil datastore with #Redis, and extend the config system.
@Python will behave exactly like @3dprinting with the added feature of crossposts. These posts will, however, be "follower only" posts, to not polute #Python and just flood everything initially.
The bot will #boost parent posts, allowing for threads and discussions to be created.
The source code will be out tomorrow, just cleaning up.
I have found you. Turns out if you access Beautiful Soup elements and don't call decompose on it, it will cause memory leaks.
#OpenRefine, est un logiciel libre de nettoyage et de mise en forme de données, il peut être aussi utilisé comme couteau-suisse de Wikimedia, pour :
- importer des données dans Wikidata
-importer des données dans Commons (nouveauté 2021 🎉)
-importer des fichiers dans Commons (nouveauté 2022 🎉)
- récupérer des pages de n'importe quel projet Wikimédia
- récupérer n'importe quelle page sur Internet (#webScraper).
Merci @belett !
Source : https://programme.wikiconvention.fr/#_session-openrefine
#WikiConvention