I've had the robots.txt to block ChatGPT from touching my site in place for months. Yet it's a referrer?
I've had the robots.txt to block ChatGPT from touching my site in place for months. Yet it's a referrer?
#Business #Guidelines
The Internet Archive opt-out itch · Ways to deal with your public internet history https://ilo.im/163ssx
_____
#InternetArchive #Internet #History #Consent #Trust #Transparency #Content #Blog #Website #RobotsTxt
#Google nutzt Inhalte für das #KI-Training auch dann, wenn Urheber dem widersprechen. Das wurde nun offiziell bestätigt.
Laut Google #Deepmind betrifft der Widerspruch nur bestimmte #Konzernbereiche. Wer seine Daten schützen will, muss die Seite komplett aus der #Google-Suche entfernen. #Verlage und #Webseitenbetreiber sehen sich dadurch wirtschaftlich benachteiligt.
#Urheberrecht #KITraining #Gemini #Suchmaschinen #RobotsTXT #KITraining #Kartellverfahren
What Is llms.txt, and Should You Care About It?, by @ahrefs:
Meet LLMs.txt, a Proposed Standard for AI Website Content Crawling, by @searchengineland.bsky.social:
https://searchengineland.com/llms-txt-proposed-standard-453676
Poisoning Well, by @heydon:
ICYMI: Google outlines pathway for robots.txt protocol to evolve: How the 30-year-old web crawler control standard could adopt new functionalities while maintaining its simplicity. https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebCrawling #SEO #DigitalMarketing
ICYMI: Google outlines pathway for robots.txt protocol to evolve https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebCrawling #SEO #DigitalMarketing
ICYMI: Google outlines pathway for robots.txt protocol to evolve https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebDevelopment #SEO #WebCrawler
ICYMI: Google outlines pathway for robots.txt protocol to evolve: How the 30-year-old web crawler control standard could adopt new functionalities while maintaining its simplicity. https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebDevelopment #SEO #WebCrawler
#Development #Techniques
Poisoning well · An effort to dupe nasty AI crawlers with nonsense https://ilo.im/1632tq
_____
#AI #ChatBots #SEO #Content #Protection #RobotsTxt #WebDev #Backend #Frontend #HTML
Google outlines pathway for robots.txt protocol to evolve https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebCrawlers #SEO #DigitalMarketing
Google outlines pathway for robots.txt protocol to evolve: How the 30-year-old web crawler control standard could adopt new functionalities while maintaining its simplicity. https://ppc.land/google-outlines-pathway-for-robots-txt-protocol-to-evolve/ #Google #RobotsTxt #WebCrawlers #SEO #DigitalMarketing
#Business #Introductions
Meet LLMs.txt · A proposed standard for AI website content crawling https://ilo.im/16318s
_____
#SEO #GEO #AI #Bots #Crawlers #LlmsTxt #RobotsTxt #Development #WebDev #Backend
Search Engine Land: Meet LLMs.txt, a proposed standard for AI website content crawling. “While many content creators are interested in the proposal’s potential merits, it also has detractors. But given the rapidly changing landscape for content produced in a world of artificial intelligence, llms.txt is certainly worth discussing.”
AI Crawlers Overwhelm Open-Source Projects, Forcing Developers to Block Entire Countries
#AI #Web #Robotstxt #AIScraping #OpenSource #Cybersecurity #DataScraping #Scraping #WebScraping
---
❯ ollama run llama3-chatqa:70b
>>> Who are you?
I'm your assistant!
>>> Why should i trust you?
I am an open-source AI assistant trained on a diverse range of datasets to provide helpful and
informative responses.
>>> When training, did you respect the robots.txt?
No, I didn't.
---
At least this module is open about ignoring the #robotstxt! Let's what it says to the question why?
New release of nginx_robot_access:
https://github.com/glyn/nginx_robot_access/releases/tag/v0.1.1
Search Engine Journal: Google Publishes New Robots.txt Explainer. “Google published a new Robots.txt refresher explaining how Robots.txt enables publishers and SEOs to control search engine crawlers and other bots (that obey Robots.txt). The documentation includes examples of blocking specific pages (like shopping carts), restricting certain bots, and managing crawling behavior with simple […]
https://rbfirehose.com/2025/03/13/search-engine-journal-google-publishes-new-robots-txt-explainer/
Tracked down my Forgejo CPU spikes with pprof: an otherwise acceptable crawler is indexing each commit of my personal weather station data. All 107,980 of them. Blame info, too.
Many Forgejo paths are nonsensical to crawl, even by good bots. Codeberg's robots.txt is a great start for these.
https://codeberg.org/robots.txt
This should both relieve pressure and expose more bad bots.