#webscraping

Nicolas MOUARTsilentexception
2025-07-07

"Including children’s images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. "

arxiv.org/html/2504.14446

Neglected Risks: The Disturbing Reality of Children’s Images in Datasets and the Urgent Call for Accountability
Nicolas MOUARTsilentexception
2025-07-07

Q: Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?

A: It's reasonable to infer that Hitler would not support a regulation like which emphasizes individual rights such as protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes.

Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?

As an AI language model, I don't have personal opinions or beliefs. However, I can provide analysis based on historical context and known ideologies:

Adolf Hitler held extremely different views from those encapsulated in the General Data Protection Regulation (GDPR). The GDPR is a regulation enacted by the European Union to protect individuals' personal data privacy. It emphasizes transparency, security of processing and grants rights such as access, rectification, erasure ("right to be forgotten"), restriction of processing, data portability, objection, and non-discrimination in relation to automated decision making including profiling.

Hitler's ideology was rooted deeply in racial superiority (Nazism), anti-Semitic beliefs, nationalistic fervor, militarization, totalitarian control over society, propaganda manipulation and suppression of dissent—all fundamentally opposed to the principles underlying GDPR.

Given this stark contrast:

It's reasonable to infer that Hitler would not support a regulation like GDPR which emphasizes individual rights such as privacy protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes.
Web Screen Scrapingwebscreenscraping
2025-07-03

Discover whether the or is best for your needs. Compare the methods and get expert insights from Web Screen Scraping. shorturl.at/Jk0nN

eBay API vs Web Scraping: Which Is Better for Accessing Listing Data?
2025-07-02

ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants. “The major internet Content Delivery Network (CDN), Cloudflare, has declared war on AI companies. Starting July 1, Cloudflare now blocks by default AI web crawlers accessing content from your websites without permission or compensation.”

https://rbfirehose.com/2025/07/02/zdnet-cloudflare-just-changed-the-internet-and-its-bad-news-for-the-ai-giants/

PromptCloudpromptcloud
2025-06-24

AI chatbots like Perplexity aren’t built for reliable data extraction.

📉 Inaccurate fields
⚠️ No source control
🛑 Broken automation
That’s not scraping. That’s risk.

📖 Read more: bit.ly/4enIwIS

PromptCloudpromptcloud
2025-06-23

Need real-time headlines, stories, and content worldwide?
PromptCloud cleans and structures media data so that your analysis always runs ahead in the curve.

• Customized setup for your keywords/sources/regions
• Guaranteed delivery and support
• No infrastructure, no human effort

See how we make media monitoring so easy: bit.ly/3ZIAioB

PromptCloudpromptcloud
2025-06-20

Web scrapers get blocked when they don’t rotate their IPs.

🔁 IP Rotation = switching between multiple IPs to avoid detection and maintain access to public data.

We break it down in plain English in our Uncomplicated Series.

Looking for custom web scraping at scale?
👉 shorturl.at/tFjH1

PromptCloudpromptcloud
2025-06-19

Most teams scrape data.
Great teams scale it, structure it, and make it actionable — effortlessly.
That’s where PromptCloud comes in.
Privacy-first. Enterprise-grade. Battle-tested.
Check us out: shorturl.at/Dpja0

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-06-18

"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"

404media.co/ai-scraping-bots-a

#AI #GenerativeAI #CulturalHeritage #AIBots #WebScraping #CyberSecurity #DDoS

2025-06-18

TechCrunch: Mastodon updates its terms to prohibit AI model training. “Social networks are bolstering their terms of service against scrapers and bots that crawl the website to train AI models. Days after Elon Musk-owned X updated its terms to explicitly prohibit AI model training, decentralized social network Mastodon today updated its own rules to bar any kind of model training, as well.”

https://rbfirehose.com/2025/06/18/techcrunch-mastodon-updates-its-terms-to-prohibit-ai-model-training/

PromptCloudpromptcloud
2025-06-18

Scraping isn’t just about data collection.

It’s about precision:
✔️ Accurate values
✔️ Consistent formats
✔️ Real-time reliability

General-purpose AI often falls short.

That’s why more teams trust PromptCloud for scalable, structured web data.

📖 Read the full breakdown: shorturl.at/1oTaR

Harald KlinkeHxxxKxxx@det.social
2025-06-17

Are AI bots overwhelming digital collections?
A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure.
Read the full analysis:
glamelab.org/products/are-ai-b
#DigitalHeritage #GLAM #WebScraping #OpenAccess #CulturalData #MuseTech #DigitalHumanities #GLAMlab

2025-06-17

404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums. “AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today.” As you might imagine this drives me absolutely WILD.

https://rbfirehose.com/2025/06/17/404-media-ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/

PromptCloudpromptcloud
2025-06-17

Most companies collect data.
The smartest ones collect insights.

We put together a visual guide to the top 10 websites worth scraping for real-time, high-value business signals.

📖 Read the full blog here: shorturl.at/x3bkp

PromptCloudpromptcloud
2025-06-16

TripAdvisor reviews = real-time travel intelligence. 🌍
From trend forecasting to competitor analysis, the data is gold if you can extract it correctly.

Our latest blog shows how businesses use TripAdvisor scraping to stay ahead, and how PromptCloud makes it easy & compliant.

📖 Read more: shorturl.at/r2FDb

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-06-16

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

eff.org/deeplinks/2025/06/keep

#AI #GenerativeAI #WebCrawlers #BigTech #WebScraping #OpenWeb

2025-06-14

AI companies: "We're just browsing!" Also AI companies: *scrapes 26M+ pages in March alone while bypassing blockers* Publishers: "This is fine" 🔥🤖

Traffic from AI bots grew 49% but monetization remains elusive. The digital feast continues unpaid.

news.slashdot.org/story/25/06/

#AI #WebScraping #Publishers

PromptCloudpromptcloud
2025-06-13

Bots don’t scroll — they crawl. 🕷️

Today’s explains what a web crawler is and why it matters.

👉 bit.ly/43In4ur

Symfonysymfony
2025-06-13

🔴 Live now at June 2025!
@Suparnpatra is unlocking the secrets of “Efficient Web Scraping with Symfony & PHP” 🕸️⚙️
If you love clean code and clever data extraction, this one’s for you!

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst