Lmst

Maxun v0.0.32 ra mắt với tính năng AI-native và ghi âm thời gian thực, mã nguồn mở, cho phép tự lưu trữ và trích xuất dữ liệu web không cần code. Hỗ trợ tích hợp với LlamaIndex, LangChain, OpenAI SDK, và nhiều framework AI khác qua SDK. Chế độ AI Extract tự động điều hướng, không cần URL. Ghi âm thời gian thực chính xác với hành động: gõ, click, cuộn, điều hướng. Phù hợp xây dựng workflow và agent thông minh. #Maxun #WebScraper #AIIntegration #OpenSource #DataExtraction #TríchXuấtDữLiệu #AI #MãN

Cory – Blocking Countries because of scrapers

What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.

99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.

Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.

I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.

#webCrawler #webScraper

Cory – Blocking Countries because of scrapers
What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe
https://curtismchale.ca/2026/01/14/cory-blocking-countries-because-of-scrapers/
#LinksOfInterest #WebCrawler #WebScraper

Bright Data’s new API lets developers weave AI/ML models, LLMs and generative AI directly into web‑scraping workflows while keeping bots at bay. JavaScript‑ready, open‑source friendly, and built for seamless anti‑bot protection. Dive into the benchmarks and see how it powers smarter data pipelines. #BrightDataAPI #AIintegration #AntiBot #WebScraper

🔗 https://aidailypost.com/news/bright-data-api-delivers-seamless-aiml-integration-antibot-protection

🔍 / #software / #automation / #scraping

#WebScraper - The #1 web scraping extension

The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed.

🐱🔗 https://laravista.altervista.org/CatLink/links/454

#catlink #softwareautomation

Tìm kiếm web không cần OpenWebUI? Thử MCP server với Jan để scraper ddgs hoặc quantized models với llama.cpp #TìmKiếmWeb #WebSearch #OpenWebUI #LLaMA #AI #MCP #Jan #QuantizedModels #LlamaCpp #TrợGiúp #HỗTrợ #AItools #WebScraper #TìmKiếmTrựcTuyến #CôngCụTìmKiếm #TrangWeb #TìmKiếmNhanh

https://www.reddit.com/r/LocalLLaMA/comments/1os0xwn/how_to_get_web_search_without_openwebui/

🤖 Ottenere l'elenco di tutte le immagini di una pagina HTML con PHP
Sviluppare un web scraper con PHP per ottenere l'elenco completo di tutte le immagini di un url...

👉 https://www.selectallfromdual.com/blog/1639

#html #php #webscraper

8 Web Scraping & Crawling Tools mit n8n-Anbindung (Workflow-Vorlage zum kostenlosen Download)

Wir schauen uns heute an, wie ihr Web Scraping und Crawling betreiben könnt. Dazu schauen wir uns 8 verschiedene Tools an und verbinden diese auch direkt mit n8n, damit ihr die extrahierten Daten in einem Workflow weiter verarbeiten könnt.

https://www.youtube.com/watch?v=LP571gnIg7A

#n8n #ki #automatisierung #webscraping #webcrawler #webscraper

3/

For more on scraping (as in web-scraping) see here:
https://mastodon.social/@reiver/114353728684249608

CC: @404mediaco

#Scraper #Scraping #WebScraper #WebScraping

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco

#Scraper #Scraping #WebScraper #WebScraping

1/

If these researchers used a typical HTTP-based API that returns JSON, then —

What these researchers did is NOT scraping.

CC: @404mediaco

RE: https://www.404media.co/researchers-scrape-2-billion-discord-messages-and-publish-them-online/

#Scraper #Scraping #WebScraper #WebScraping

"Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active."

Oh joy, another "game-changing" #webscraper named #Scraperr 🤖—because apparently, the internet was just crying out for one more script-kiddie #tool to scrape and bloat their hard drives with HTML they’ll never use. Congrats, #GitHub user, your contribution to the overload of useless data is truly groundbreaking. 🚀🎉
https://github.com/jaypyles/Scraperr #dataoverload #scriptkiddie #HackerNews #ngated

Scraperr – A Self Hosted Webscraper

https://github.com/jaypyles/Scraperr

#HackerNews #Scraperr #Webscraper #SelfHosted #TechTools #OpenSource

Um im föderalen Verband zu erfahren, welche Aktivitäten es in bestimmten Tätigkeitsbereichen gibt, wird im DRK mit Webscraping der Websites der Kreis- und Landesverbände experimentiert.
➡️ https://drk-wohlfahrt.de/blog/eintrag/mit-webscraping-data-science-die-wohnungslosenhilfen-im-drk-verstehen.html ("Wie Data Science das DRK in der Wohnungslosenhilfe unterstützen kann")

#DRK #RotesKreuz #DataScience #DataScienceHub #Webscraping #Webscraper #DSSG #Wohlfahrt #Wohlfahrtspflege

#Development #Analyses
The text file that runs the internet · Is a basic social contract of the web falling apart? https://ilo.im/15xzdk

_____
#AI #AiModel #GenerativeAI #WebBot #WebCrawler #WebScraper #SearchEngine #Website #Blog #RobotsTxt

#webScraper

Client Info