#robotstxt

𝕂𝚞𝚋𝚒𝚔ℙ𝚒𝚡𝚎𝚕kubikpixel@chaos.social
2025-12-12

»RSL 1.0 (Really Simple Licensing) statt robots.txt — Neuer Standard für Internet-Inhalte:
Ein neuer Standard, um Inhalte im Internet zu schützen. RSL wird von Akteuren wie Verlagen und der Werbebranche unterstützt.«

Ich erfahre davon erst jetzt - mal sehen wie weit dies nützlich ist und allgemein Webinhalte schützt oder nur wieder die kommerzielle Daten.

👉 heise.de/news/RSL-1-0-Standard

#rsl #copyright #robotstxt #webdev #web #realsimplelicensing #webstandards #standart #internet #verlage #werbung

Le site de Korbenkorben.info@web.brid.gy
2025-12-11

RSL 1.0 - L'heure pour les IA, de passer à la caisse, a sonné

fed.brid.gy/r/https://korben.i

<p>On vit une époque formidable (non), car d&rsquo;un côté,
<a href="https://www.theregister.com/2025/12/08/publishers_say_no_ai_scrapers">5,6 millions de sites web bloquent maintenant le GPTBot d&rsquo;OpenAI</a>
,
<a href="https://www.cloudflare.com/press/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/">5,8 millions bloquent ClaudeBot</a>
alors que de l&rsquo;autre côté, ce sont
<a href="https://www.webpronews.com/cloudflares-2025-robots-txt-update-blocks-ai-scraping-adds-pay-per-crawl/">13,26% des bots IA qui se contrefoutent royalement des robots.txt</a>
. Les webmasters sont tous en PLS, et plantent des pancartes &ldquo;<em>Propriété privée - IA interdit</em>&rdquo; partout&hellip; Mais je vous le donne en mille Émile, ça ne sert strictement à rien !</p>
<p>Il y a quand même des gens très intelligents qui se penchent sur le sujet et hier, c&rsquo;est un nouveau standard qui vient de sortir pour dire stop à cette comédie ! Cela s&rsquo;appelle
<a href="https://rslstandard.org/press/rsl-1-specification-2025">Really Simple Licensing (RSL) 1.0</a>
et ça propose quelque chose de radical : Arrêter de bloquer, et commencer à facturer ! Miam !</p>
<p>Concrètement, c&rsquo;est un petit fichier texte pour passer du fuck-off à la négociation commerciale. Car oui on le sait, le problème avec le robots.txt, c&rsquo;est que c&rsquo;est comme demander poliment à des cambrioleurs de ne pas rentrer chez vous. Ça marchait en 1994 quand le web étai
2025-12-11

The Register: Publishers say no to AI scrapers, block bots at server level . “Online traffic analysis conducted by BuiltWith, a web metrics biz, indicates that the number of publishers trying to prevent AI bots from scraping content for use in model training has surged since July. About 5.6 million websites presently have added OpenAI’s GPTBot to the disallow list in their robots.txt file, up […]

https://rbfirehose.com/2025/12/11/the-register-publishers-say-no-to-ai-scrapers-block-bots-at-server-level/

The New York Times sues Perplexity for producing ‘verbatim’ copies of its work – The Verge

Credit: NYT Times, gettyimages-2249036304

The New York Times sues Perplexity for producing ‘verbatim’ copies of its work

The NYT alleges Perplexity ‘unlawfully crawls, scrapes, copies, and distributes’ work from its website.

by Emma Roth, Dec 5, 2025, 7:42 AM PS, Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

The New York Times has escalated its legal battle against the AI startup Perplexity, as it’s now suing the AI “answer engine” for allegedly producing and profiting from responses that are “verbatim or substantially similar copies” of the publication’s work.

The lawsuit, filed in a New York federal court on Friday, claims Perplexity “unlawfully crawls, scrapes, copies, and distributes” content from the NYT. It comes after the outlet’s repeated demands for Perplexity to stop using content from its website, as the NYT sent cease-and-desist notices to the AI startup last year and most recently in July, according to the lawsuit. The Chicago Tribune also filed a copyright lawsuit against Perplexity on Thursday.

The New York Times sued OpenAI for copyright infringement in December 2023, and later inked a deal with Amazon, bringing its content to products like Alexa.

Perplexity became the subject of several lawsuits after reporting from Forbes and Wired revealed that the startup had been skirting websites’ paywalls to provide AI-generated summaries — and in some cases, copies — of their work. TheNYT makes similar accusations in its lawsuit, stating that Perplexity’s crawlers “have intentionally ignored or evaded technical content protection measures,” such as the robots.txt file, which indicates the parts of a website crawlers can access.

Perplexity attempted to smooth things over by launching a program to share ad revenue with publishers last year, which it later expanded to include its Comet web browser in August.

Related

“By copying The Times’s copyrighted content and creating substitutive output derived from its works, obviating the need for users to visit The Times’s website or purchase its newspaper, Perplexity is misappropriating substantial subscription, advertising, licensing, and affiliate revenue opportunities that belong rightfully and exclusively to The Times,” the lawsuit states.

Continue/Read Original Article Here: The New York Times sues Perplexity for producing ‘verbatim’ copies of its work | The Verge

#AI #artificialIntelligence #Copyright #Crawlers #Distribution #Lawsuit #NYTWork #OpenAI #Perplexity #RobotsTxt #Scrapping #Sues #TheNewYorkTimes #TheVerge #VerbatimCopies

NYT Times gettyimages-2249036304
2025-12-07

How a web crawler is supposed to work:

1. Reads /robots.txt
2. Parses robots.txt and honors User-Agent | Allow / Disallow designations
3. Returns periodically to retrieve permitted content

How AI/LLM training crawlers work:

1. Crawls entire website
2. Reads /robots.txt
3. Returns 10 minutes later
4. GOTO 1.

#AI #LLM #webCrawlers #robotsTxt 🔹

Sandstein NMsandstein_nm
2025-12-01

robots.txt – kleine Datei, große Wirkung

Wenn Suchmaschinen deine Website besuchen, schauen sie zuerst in die robots.txt. Dort steht, was gecrawlt werden soll – und was nicht.

Das hilft z.B.:
• wichtige Seiten schneller auffindbar zu machen
• unwichtige Bereiche auszuschließen
• Suchmaschinen-Crawls effizient zu gestalten

Wir erklären die wichtigsten Regeln und zeigen Beispiele aus der SEO-Praxis.

Mehr dazu im Blog:
🔗 t1p.de/btwdi

🦸 Votre application web (coucou Wiki.js) n'a pas de robots.txt ? Pas de panique ! 🚨

Quand une solution ne propose pas de gestion de robots.txt intégrée, NPM arrive à la rescousse ! 📦

Découvrez comment un simple module peut servir ce fichier essentiel directement, vous sauvant la mise et assurant que les moteurs de recherche et autres bots respectent vos règles.

➡️ Le sauveur NPM est là : wiki.blablalinux.be/fr/gestion

#NPM #robotsTxt #WikiJS #OutilCLI #Dépannage

🤖 STOP aux robots IA indiscrets sur votre site WordPress ! 🚫

Vous en avez marre que les IA viennent sniffer et piller votre contenu ? Protégez votre jardin secret numérique ! 🤫

Découvrez comment utiliser un simple fichier robots.txt pour dire gentiment "non merci" aux aspirateurs d'IA. C'est facile, c'est efficace, et c'est un peu flippant pour Skynet ! 😉

➡️ La marche à suivre ici : wiki.blablalinux.be/fr/robots-

#WordPress #AntiIA #robotsTxt #SécuritéWeb #BlablaLinux

Virebentvirebent
2025-11-28

📝 New article: Why We Reject Google: Our Anti-Surveillance SEO Policy

An in-depth look at why Virebent.art deliberately blocks Google and other surveillance-based crawlers, and our strategy for visibility in a privacy-first web.

🔗 virebent.art/blog/seo-policy.h

Inautiloinautilo
2025-11-14


Rate-limiting requests with Nginx · An alternative approach to counter AI crawlers ilo.im/168axr

_____

Claudio Piresclaudiocamposp
2025-11-09

How to Remove Robots.txt File from WordPress? 🤖📝❌ youtube.com/watch?v=Wv97WVRK3qw 🎬

2025-11-07

Khắc phục vấn đề robots.txt trong docker container khi sử dụng reverse proxy. Thêm cấu hình location ~* /robots\.txt$ vào file cấu hình để chặn bot và indexer. #docker #reverseproxy #robotstxt #selfhosted #máy_chủ #đổi_proxy #tài_liệu_đại_chung

reddit.com/r/selfhosted/commen

2025-11-05

HTML 주석으로 AI 모델 망가뜨리기: 250개면 충분하다

AI 스크래퍼들이 HTML 주석 속 링크까지 수집하는 치명적 약점을 발견. 250개의 조작된 문서만으로 거대 언어모델을 무력화할 수 있다는 최신 연구와 함께 실전 대응 전략을 소개합니다.

aisparkup.com/posts/6165

2025-10-22

구글 AI 검색에 맞선 웹 인프라 반란: Cloudflare가 380만 웹사이트 robots.txt를 바꾼 이유

구글 AI 요약으로 웹사이트 트래픽이 50% 급감하자 Cloudflare가 380만 도메인의 robots.txt를 업데이트하며 반격에 나섰습니다. 검색과 AI 요약을 분리하는 새로운 웹 표준의 등장과 그 의미를 살펴봅니다.

aisparkup.com/posts/5729

2025-10-21

The robotx.txt standard turned 30 last year. But is it still relevant in a world filled with AI bots, site scrapers, and other dubious bots?

plagiarismtoday.com/2025/10/21

#AI #RobotsTxt #Scraping

Inautiloinautilo
2025-10-21


Farewell to robots.txt (1994-2025) · “You were too good for this world.” ilo.im/167q2b

_____

2025-10-16

Warum hat der höfliche Hinweis der robots.txt zur Datenerfassung auf der eigenen Webseite durch die allgegenwärtigen Webcrawler ausgedient?

Wie verwandeln die KI-Akteure das Web von einem kollaborativen Raum in eine reine Extraktionszone für Informationen?

Dies wird kurzweilig und sehr anschaulich in "Nachruf: Abschied von robots.txt (1994-2025)" von @heiseonline beschrieben:

🌍 👉 heise.de/hintergrund/Nachruf-A

#robotsTXT #KI #Webentwicklung #Datenhoheit #heise

Ein Symbolbild einer robots.txt mit dem folgenden fiktiven Inhalt:

User-agent: Crawler1
Disallow: /

User-agent: Crawler2
Disallow: /

User-agent: *
Disallow: /default.html
Disallow: /tmp
Disallow: /private/index.html

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst