#WebCrawling

Marcus Schulerschuler
2025-12-15

Cloudflare's 2025 data reveals Google's structural advantage in AI training: Googlebot crawled 11.6% of web pages vs OpenAI's 3.6%. Publishers face an impossible choice - they can't block Google's AI crawling without losing search visibility entirely, since the same bot handles both functions.

implicator.ai/googles-quiet-co

2025-12-01

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

PPC Landppcland
2025-11-20

Google updates crawling infrastructure documentation with new technical details: Google publishes updated crawling infrastructure documentation on November 20, 2025, adding HTTP caching support details and transfer protocol specifications. ppc.land/google-updates-crawli

2025-11-01

Released scrapy-contrib-bigexporter 1.0.0 (codeberg.org/ZuInnoTe/scrapy-c) - additional export formats for the webscraping framework Scrapy.

Migrated parquet export from fastparquet to pyarrow as fastparquet is deprecated (docs.dask.org/en/stable/change)

Migrated orc export from pyorc to pyarrow to reduce the number of dependencies

#scrapy #crawling #python #parquet #orc #pyarrow #webcrawling #scraping

Lenin alevski 🕵️💻alevsk@infosec.exchange
2025-10-22

Why does Katana stand out as a web crawler? 🤔✨

Katana blends **speed** and **flexibility**, supporting standard and headless crawling. It handles JavaScript, automatic form filling, and advanced scope control features like regex-based filtering. Perfect for modern web exploration. #WebCrawling #OpenSource

🔗 Project link on #GitHub 👉 github.com/projectdiscovery/ka

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

— ✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

Mind Ludemindlude
2025-09-11

Remember when 'robots.txt' was supposed to solve all our crawling problems? Online media brands are trying a new protocol to deter 'unwanted' AI crawlers. Because clearly, we need more digital fences. What's your bet on how long it takes for a savvy AI to find a workaround?

Read more: cnet.com/tech/services-and-sof

2025-08-31

Search Engine Land: Google fixes reduced crawling issue impacting some websites. “Google has confirmed it fixed an issue with its crawlers impacting ‘some sites.’ The issue was ‘reduced / fluctuating crawling’ from Google’s end with Googlebot. It is now resolved and Google said the crawling should pick back up in the near future.”

https://rbfirehose.com/2025/08/31/search-engine-land-google-fixes-reduced-crawling-issue-impacting-some-websites/

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-08-06

"Perplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

“OpenAI is an example of a leading AI company that follows these best practices,” Cloudflare wrote. “They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.”

Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites.

For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts."

techcrunch.com/2025/08/05/some

#AI #GenerativeAI #AITraining #Perplexity #Cloudflare #AIAgents #WebCrawling #Chatbots #LLMs

2025-08-06

KIMissbrauch

Cloudflare wirft dem KI-Anbieter ##Perplexity vor, sich mit undeklarierten Crawlern Zugang zu gesperrten Websites zu verschaffen.

Trotz robots.txt-Verboten und IP-Blockaden soll Perplexity mit wechselnden User-Agents und IPs Inhalte verdeckt auslesen.

Das wäre eine Verletzung etablierter Webstandards und Missachtung von Website-Präferenzen.

blog.cloudflare.com/perplexity

#WebCrawling #BotTraffic #Cloudflare #WebSecurity #PerplexityAI #Chatbots

PPC Landppcland
2025-04-03

ICYMI: Google outlines pathway for robots.txt protocol to evolve: How the 30-year-old web crawler control standard could adopt new functionalities while maintaining its simplicity. ppc.land/google-outlines-pathw

2025-03-26

Ars Technica: Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries. “Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known […]

https://rbfirehose.com/2025/03/26/ars-technica-open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

2025-02-12

Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”

https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/

2025-01-25

- 📊 Optional: Markov-generated nonsense content to distort data
- 💻 Developed by programmer Aaron B. out of frustration with #Webcrawling practices
- ⚠️ Challenges: Server load, scalability, effectiveness questioned

2025-01-25

#Nepenthes: #Tool against #AI webcrawlers 🕷️

Generates self-referencing links, extends load times. Goal: Trap crawlers in endless loop. Developer warns against casual use. #AI #Webcrawling #DataPrivacy

🧵 ↓

heise.de/en/news/Nepenthes-a-t

2025-01-24

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst