Lmst

Ah, #wxpath, because using #XPath was just too easy before 🙄. Now with extra layers of #complexity, just in case you weren't already confused enough by web crawling! 🕸️😵‍💫
https://github.com/rodricios/wxpath #webcrawling #technews #developerhumor #HackerNews #ngated

wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath

#HackerNews #wxpath #webcrawling #XPath #technology #open-source #GitHub

Maxun v0.0.31 ra mắt – công cụ tự động thu thập và khám phá dữ liệu web mã nguồn mở, tự lưu trữ. Tính năng nổi bật: Crawl thông minh (phân tích Sitemap, lọc Regex, kiểm soát độ sâu), Search bằng truy vấn (hỗ trợ lọc theo thời gian, trích xuất nội dung đầy đủ). Phù hợp nghiên cứu, lập bản đồ website quy mô lớn. #Maxun #WebCrawling #DataExtraction #OpenSource #CôngCụLậpTrình #MãNguồnMở #ThuThậpDữLiệu

https://www.reddit.com/r/selfhosted/comments/1q42v4n/maxun_v0031_autonomous_web_discovery_search/

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

Một giao thức mới gọi là **SCP Protocol** đã được đề xuất để giải quyết tình trạng bất ổn khi thu thập dữ liệu có cấu trúc cho AI. Giao thức này giúp tối ưu hóa quá trình bò trườn (web crawling), giảm độ trễ và tăng hiệu quả trích xuất thông tin. Đây là dự án mã nguồn mở, được cộng đồng phát triển công nghệ quan tâm.

#AI #MachineLearning #WebCrawling #CôngNghệAI #MãNguồnMở

https://www.reddit.com/r/opensource/comments/1puylx0/specification_addressing_inefficiencies_in/

Cloudflare's 2025 data reveals Google's structural advantage in AI training: Googlebot crawled 11.6% of web pages vs OpenAI's 3.6%. Publishers face an impossible choice - they can't block Google's AI crawling without losing search visibility entirely, since the same bot handles both functions. #AI #WebCrawling #DigitalRights

https://www.implicator.ai/googles-quiet-conquest-what-cloudflares-data-actually-reveals-about-ais-power-grab/

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

Google updates crawling infrastructure documentation with new technical details: Google publishes updated crawling infrastructure documentation on November 20, 2025, adding HTTP caching support details and transfer protocol specifications. https://ppc.land/google-updates-crawling-infrastructure-documentation-with-new-technical-details/ #GoogleUpdates #WebCrawling #HTTPCaching #SEO #TechnicalSEO

Released scrapy-contrib-bigexporter 1.0.0 (https://codeberg.org/ZuInnoTe/scrapy-contrib-bigexporters) - additional export formats for the webscraping framework Scrapy.

Migrated parquet export from fastparquet to pyarrow as fastparquet is deprecated (https://docs.dask.org/en/stable/changelog.html#fastparquet-engine-deprecated)

Migrated orc export from pyorc to pyarrow to reduce the number of dependencies

#scrapy #crawling #python #parquet #orc #pyarrow #webcrawling #scraping

Why does Katana stand out as a web crawler? 🤔✨

Katana blends **speed** and **flexibility**, supporting standard and headless crawling. It handles JavaScript, automatic form filling, and advanced scope control features like regex-based filtering. Perfect for modern web exploration. #WebCrawling #OpenSource

🔗 Project link on #GitHub 👉 https://github.com/projectdiscovery/katana

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

— ✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

How to Recrawl a Website in Ahrefs Console | https://techygeekshome.info/how-to-recrawl-a-website-in-ahrefs-console/?fsp_sid=11609 | #ahrefs #Guide #refresh #SEO#AhrefsConsole
#RecrawlWebsite
#SEOTools
#DigitalMarketing
#WebsiteOptimization
#AhrefsGuide
#MastodonSEO
#WebCrawling
#SearchEngineOptimization
#SEOStrategy
https://techygeekshome.info/how-to-recrawl-a-website-in-ahrefs-console/?fsp_sid=11609

Remember when 'robots.txt' was supposed to solve all our crawling problems? Online media brands are trying a new protocol to deter 'unwanted' AI crawlers. Because clearly, we need more digital fences. What's your bet on how long it takes for a savvy AI to find a workaround?

#AI #TechNews #WebCrawling #DigitalRights #Privacy

https://www.theverge.com/ai-artificial-intelligence/770646/switzerland-ai-model-llm-open-apertus More like this, please. #ai #webcrawling #opensource

Search Engine Land: Google fixes reduced crawling issue impacting some websites. “Google has confirmed it fixed an issue with its crawlers impacting ‘some sites.’ The issue was ‘reduced / fluctuating crawling’ from Google’s end with Googlebot. It is now resolved and Google said the crawling should pick back up in the near future.”

https://rbfirehose.com/2025/08/31/search-engine-land-google-fixes-reduced-crawling-issue-impacting-some-websites/

Robots.txt Is a Suicide Note

https://wiki.archiveteam.org/index.php/Robots.txt

#HackerNews #RobotsTxt #SuicideNote #WebCrawling #InternetArchive #TechEthics

"Perplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

“OpenAI is an example of a leading AI company that follows these best practices,” Cloudflare wrote. “They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.”

Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites.

For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts."

https://techcrunch.com/2025/08/05/some-people-are-defending-perplexity-after-cloudflare-named-and-shamed-it/

#AI #GenerativeAI #AITraining #Perplexity #Cloudflare #AIAgents #WebCrawling #Chatbots #LLMs