#webcrawling

N-gated Hacker Newsngate
2026-01-20

Ah, , because using was just too easy before 🙄. Now with extra layers of , just in case you weren't already confused enough by web crawling! 🕸️😵‍💫
github.com/rodricios/wxpath

Hacker Newsh4ckernews
2026-01-20
2026-01-04

Maxun v0.0.31 ra mắt – công cụ tự động thu thập và khám phá dữ liệu web mã nguồn mở, tự lưu trữ. Tính năng nổi bật: Crawl thông minh (phân tích Sitemap, lọc Regex, kiểm soát độ sâu), Search bằng truy vấn (hỗ trợ lọc theo thời gian, trích xuất nội dung đầy đủ). Phù hợp nghiên cứu, lập bản đồ website quy mô lớn. #Maxun #WebCrawling #DataExtraction #OpenSource #CôngCụLậpTrình #MãNguồnMở #ThuThậpDữLiệu

reddit.com/r/selfhosted/commen

2025-12-24

Một giao thức mới gọi là **SCP Protocol** đã được đề xuất để giải quyết tình trạng bất ổn khi thu thập dữ liệu có cấu trúc cho AI. Giao thức này giúp tối ưu hóa quá trình bò trườn (web crawling), giảm độ trễ và tăng hiệu quả trích xuất thông tin. Đây là dự án mã nguồn mở, được cộng đồng phát triển công nghệ quan tâm.

#AI #MachineLearning #WebCrawling #CôngNghệAI #MãNguồnMở

reddit.com/r/opensource/commen

Marcus Schulerschuler
2025-12-15

Cloudflare's 2025 data reveals Google's structural advantage in AI training: Googlebot crawled 11.6% of web pages vs OpenAI's 3.6%. Publishers face an impossible choice - they can't block Google's AI crawling without losing search visibility entirely, since the same bot handles both functions.

implicator.ai/googles-quiet-co

2025-12-01

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

PPC Landppcland
2025-11-20

Google updates crawling infrastructure documentation with new technical details: Google publishes updated crawling infrastructure documentation on November 20, 2025, adding HTTP caching support details and transfer protocol specifications. ppc.land/google-updates-crawli

2025-11-01

Released scrapy-contrib-bigexporter 1.0.0 (codeberg.org/ZuInnoTe/scrapy-c) - additional export formats for the webscraping framework Scrapy.

Migrated parquet export from fastparquet to pyarrow as fastparquet is deprecated (docs.dask.org/en/stable/change)

Migrated orc export from pyorc to pyarrow to reduce the number of dependencies

#scrapy #crawling #python #parquet #orc #pyarrow #webcrawling #scraping

Lenin alevski 🕵️💻alevsk@infosec.exchange
2025-10-22

Why does Katana stand out as a web crawler? 🤔✨

Katana blends **speed** and **flexibility**, supporting standard and headless crawling. It handles JavaScript, automatic form filling, and advanced scope control features like regex-based filtering. Perfect for modern web exploration. #WebCrawling #OpenSource

🔗 Project link on #GitHub 👉 github.com/projectdiscovery/ka

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

— ✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

Mind Ludemindlude
2025-09-11

Remember when 'robots.txt' was supposed to solve all our crawling problems? Online media brands are trying a new protocol to deter 'unwanted' AI crawlers. Because clearly, we need more digital fences. What's your bet on how long it takes for a savvy AI to find a workaround?

Read more: cnet.com/tech/services-and-sof

2025-08-31

Search Engine Land: Google fixes reduced crawling issue impacting some websites. “Google has confirmed it fixed an issue with its crawlers impacting ‘some sites.’ The issue was ‘reduced / fluctuating crawling’ from Google’s end with Googlebot. It is now resolved and Google said the crawling should pick back up in the near future.”

https://rbfirehose.com/2025/08/31/search-engine-land-google-fixes-reduced-crawling-issue-impacting-some-websites/

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-08-06

"Perplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

“OpenAI is an example of a leading AI company that follows these best practices,” Cloudflare wrote. “They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.”

Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites.

For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts."

techcrunch.com/2025/08/05/some

#AI #GenerativeAI #AITraining #Perplexity #Cloudflare #AIAgents #WebCrawling #Chatbots #LLMs

2025-08-06

KIMissbrauch

Cloudflare wirft dem KI-Anbieter ##Perplexity vor, sich mit undeklarierten Crawlern Zugang zu gesperrten Websites zu verschaffen.

Trotz robots.txt-Verboten und IP-Blockaden soll Perplexity mit wechselnden User-Agents und IPs Inhalte verdeckt auslesen.

Das wäre eine Verletzung etablierter Webstandards und Missachtung von Website-Präferenzen.

blog.cloudflare.com/perplexity

#WebCrawling #BotTraffic #Cloudflare #WebSecurity #PerplexityAI #Chatbots

PPC Landppcland
2025-04-03

ICYMI: Google outlines pathway for robots.txt protocol to evolve: How the 30-year-old web crawler control standard could adopt new functionalities while maintaining its simplicity. ppc.land/google-outlines-pathw

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst