Lmst

Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction

https://www.reddit.com/r/LocalLLaMA/comments/1qt0t3g/github_websitecrawler_extract_data_from_websites/

Cory – Blocking Countries because of scrapers

What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.

99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.

Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.

I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.

#webCrawler #webScraper

Cory – Blocking Countries because of scrapers
What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe
https://curtismchale.ca/2026/01/14/cory-blocking-countries-because-of-scrapers/
#LinksOfInterest #WebCrawler #WebScraper

Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api

🦀 Crab.so – công cụ crawler web miễn phí, nhẹ, dành cho SEO. Được phát triển như dự án phụ, chưa phải đối thủ Screaming Frog nhưng hữu ích cho kiểm tra site. Mọi phản hồi, đề xuất cải tiến đều hoan nghênh! #SEO #WebCrawler #CôngCụMiễnPhí #Crawl #SideProject #CôngCụSEO

https://www.reddit.com/r/SideProject/comments/1qc09ox/a_free_lightweight_screaming_frog_alternative/

I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕

Is there a standard hostname/domain to use in the documentation for a web spider? Ideally the host/domain should exist, have multiple webpages, and be OK with random traffic from people testing the web spider example code.

#webspider #webcrawler

Researchers Hack ChatGPT Memories and Web Search Features

attackers can set up a new website that is likely to show up in web search results for niche topics. ChatGPT relies on Bing and OpenAI’s crawler for web searches.

#chatgpt #openai #bing #webcrawler #security #cybersecurity #hackers #hacking #hacked

https://www.securityweek.com/researchers-hack-chatgpt-memories-and-web-search-features/

Was ist denn da seit ein paar Tagen für ein
#Crawler auf meiner Webseite unterwegs? So viele Connections vom Webserver sehe ich nicht immer.

Mal schauen, wann der durch ist. Laut Check der IPs: CHINANET, 21ViaNet(China),Inc., Tencent cloud computing (Beijing)
#China #Webcrawler

Hébergeur de site, mon prototype de crawler d’actualité indép. vous dérange; ou au contraire vous souhaitez plus de détails ? N'hésitez pas à me contacter.

#searchengine #WebCrawler #noia #oldschool

Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.

Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform

8 Web Scraping & Crawling Tools mit n8n-Anbindung (Workflow-Vorlage zum kostenlosen Download)

Wir schauen uns heute an, wie ihr Web Scraping und Crawling betreiben könnt. Dazu schauen wir uns 8 verschiedene Tools an und verbinden diese auch direkt mit n8n, damit ihr die extrahierten Daten in einem Workflow weiter verarbeiten könnt.

https://www.youtube.com/watch?v=LP571gnIg7A

#n8n #ki #automatisierung #webscraping #webcrawler #webscraper

https://social.emucafe.org/naferrell/user-agent-godhatesmastodon-08-22-25/

The New Leaf Journal became inaccessable for about 1-2 minutes this morning. Fortunately, I opened the site almost immediately when it happened. I opened my server logs and found what was probably the offending bot/scraper so I could block it. I kept the server logs open to watch for any other questionable activity. I saw an interesting user-agent string.

Aug 22 11:22:46 [IP ADDRESS] - - [22/Aug/2025:15:22:46 +0000] "GET / HTTP/1.1" 200 63425 "-" "GodHatesMastodon"

My two sites are often crawled by Mastodon servers and Fediverse-related crawlers because both sites function as ActivityPub servers (you can follow this account on the Fediverse at @naferrell@social.emcafe.org). I had not previously seen the crawler GodHatesMastodon, but I understand through the grapevine that there are some questionable instances out there. Fortunately, there is no reason for anyone to hate The New Leaf Journal. As my friend and colleague Victor V. Gurbo once explained, “The New Leaf Journal is a family website.”

#activitypub #fediverse #mastodon #webCrawler

#Firecrawl, an #opensource #webcrawler for #developers and #AIagents, raised $14.5 million in a Series A round led by Nexus Venture Partners. The company, which is already profitable, plans to use the funds to expand its team and develop tools to help website owners get paid when AI uses their content. https://techcrunch.com/2025/08/19/ai-crawler-firecrawl-raises-14-5m-is-still-looking-to-hire-agents-as-employees/?Pirates.BZ #Pirates #Tech #Startup #News

eigentlich wärs ja cool wenns einen standardisierten ort gäbe, wo sich #webcrawler einen dump der jeweiligen website abholen können. so mit allen sachen die von suchmaschinen geindext werden sollen. könnte einfach unter einer https://en.wikipedia.org/wiki/Well-known_URI liegen.

Cloudflare sperrt den Perplexity-Bot.

Cloudflare wirft Perplexity „Stealth Crawling“ vor 🕵️ Laut #Cloudflare umgeht der #KI-Suchdienst #Perplexity gezielt #Sperren gegen seine #Webcrawler, indem er seine #Identität verschleiert.

Techniken zur Umgehung von Blockaden 🔄 Perplexity soll Bots als Chrome-Browser tarnen, IP-Adressen rotieren und Netzwerkkennungen ändern, um weiter Inhalte abzugreifen. (1/2)

Perplexity AI przyłapane na gorącym uczynku. Firma miała potajemnie omijać blokady stron

Firma Cloudflare, gigant w dziedzinie bezpieczeństwa i infrastruktury internetowej, opublikowała raport oskarżający popularną wyszukiwarkę AI, Perplexity, o stosowanie nieetycznych praktyk.

Według dochodzenia, Perplexity miało używać potajemnych, niezadeklarowanych crawlerów do pobierania treści ze stron internetowych, które wyraźnie zablokowały dostęp dla botów tej firmy.

Dochodzenie Cloudflare zostało wszczęte po skargach od klientów, którzy zauważyli, że Perplexity wciąż indeksuje ich witryny, mimo zastosowania blokad. Jak się okazało, mechanizm działania był prosty, ale skuteczny. Gdy standardowy bot Perplexity (PerplexityBot) napotykał blokadę, firma miała przełączać się na „tryb stealth”. Używała wtedy crawlerów z generycznym identyfikatorem przeglądarki (np. Chrome), które dodatkowo działały z puli niezgłoszonych adresów IP i różnych sieci, aby ukryć swoją prawdziwą tożsamość. Co najważniejsze, te potajemne boty w ogóle nie próbowały odczytać pliku robots.txt – pliku, w którym właściciele stron określają zasady dla botów.

Skala problemu była ogromna. Cloudflare zaobserwowało takie zachowanie na dziesiątkach tysięcy domen, a liczba zapytań od ukrytych botów Perplexity sięgała milionów dziennie. To praktyka stojąca w sprzeczności z działaniami innych firm, jak OpenAI, które jasno deklarują swoje boty i respektują dyrektywy zawarte w plikach robots.txt skonfigurowanych przez właścicieli stron.

W odpowiedzi na te odkrycia, Cloudflare podjęło zdecydowane kroki. Po pierwsze, firma usunęła Perplexity ze swojej listy „zweryfikowanych botów”, co utrudni jego interakcje ze stronami chronionymi przez Cloudflare. Po drugie, wprowadzono nowe, heurystyczne zabezpieczenia do swoich reguł. Zamiast blokować konkretne, znane boty, system będzie teraz automatycznie wykrywał i blokował podejrzane zachowania, takie jak próba ukrycia tożsamości przez crawlera. Ochrona ta jest dostępna dla wszystkich klientów Cloudflare. Co ważne, z wdrożonej ochrony przez Cloudflare mogą skorzystać nie tylko podmioty odpłatnie korzystające z usług sieciowego giganta, ale także użytkownicy planów darmowych.

Koniec z przeglądaniem, czas na działanie. Perplexity rzuca wyzwanie Google z przeglądarką AI Comet

#AI #Cloudflare #cyberbezpieczeństwo #news #PerplexityAI #prywatność #robotsTxt #scrapowanieDanych #sztucznaInteligencja #webCrawler

#AIEngineering #aiethics #webcrawler

AI site Perplexity uses “stealth tactics” to flout no-crawl edicts, Cloudflare says

https://arstechnica.com/information-technology/2025/08/ai-site-perplexity-uses-stealth-tactics-to-flout-no-crawl-edicts-cloudflare-says/

#IETF diskutiert Maßnahmen gegen den Ansturm der KI-#Crawler | heise online https://www.heise.de/news/Technische-Massnahmen-gegen-den-Ansturm-der-KI-Crawler-10497930.html #Webcrawler #ArtificialIntelligence

WebCrawler fue el primer metabuscador de la Web en proporcionar búsqueda de texto completo. Se publicó en internet el 20 de abril de 1994 y fue creado por Brian Pinkerton en la Universidad de Washington. Fue comprado por America Online el 1 de junio de 1995 y vendido a Excite el 1 de abril de 1997
#retrocomputingmx #webcrawler #InternetHistory

#WebCrawler

Client Info