Lmst

You are making a mess of things! A multitude of access Logs are now over tenfold of what they were a week ago!

I have upgraded our #abuse detection system accordingly and placed #GPTBot in the penalty box.

This results in a better abuse detection in general. For that I thank you. It also results in #IPblocks of already a dozen of your abusing IPs.

I can see the load diminishing on the server now..

2/2

#badbot #openai

#GPTBot macht nach wie vor ca. 20% der Zugriffe dieser Mastodon-Instanz aus, aber der Crawler bekommt nur noch von #Iocaine generierten Unsinn. Das reduziert die Datenmenge, die wir an ihn ausliefern, drastisch und zerstört die Qualität unseres Datensatzes für ihn vollkommen.

Es hilft uns also Kosten zu sparen, verschlechtert die LLM und macht auch noch diebische Freude! Win-Win-Win! :KritischerTreffer:

#MastoAdmin #OpenAI

Ein Screenshot einer access.log Auswertung durch goaccess. Viel Text, die wichtigste Zeile zeigt 51739 Zugriffe, mit einem Anteil von 20.60% durch 2 Visitor, die den Useragent GPTBot/1.2 angeben

Screenshot des nach "GPTBot" gegreppten access.log. Einer der Zugriffe geht auf einen sehr langen URL, der aus Unsinnigen Begriffen besteht. Es folgt die sehr lange Logzeile:

20.171.207.0 - - [26/May/2025:22:01:49 +0200] "GET /@dnddeutsch/Languid-specks/Naertho-snapping/Shardie-mates/atumble/gentlyand-news/Reader-chargers/labyrinthine/Pawing-kindling/bled-movementgood/forward/WASTERDEEP/mainland-Curses/saidZaraela/job/PALACE/Yonder-REMAIN/SUTHOOL/arsenal/REGARDED-hie/terrain/SAYING/slightly-unsteadily/shrinking-Duskene/Have/mummified-prowled/bazaar-opinion/identity/intelligence-noisilyby/story/courtierand-adopted/overweary/stubbyfingered-Bravo/gloved/smoked-weakness/COLLECTION-saggingbut/Stink/Magraths-forbade/thammarchs-aheadto/frustrationfilled-trodden/Halllet/reassembled/slayersforhire/nowperhaps-CHUCKLE/maws-roadfrom/bend/started-Talontarto/bored/Endhaltestelle-Berggipfeln/gestreut-Wahre/stob/Gruppenzusammenhalt-Weile/rammst/Ersch%C3%B6pfungsstufen-K%C3%B6nigseichen/steigt/ersonnen/PionierProfils-fasten/share/autarker/morgen/niederduckte/Mystics/Aber/wiederkehrten-Wirtschaftsgeb%C3%A4uden/ragt/breitschultriges-erheben/kaltbl%C3%BCtiger/Leblos-Wurfhaken/A/Kometenschweif/ HTTP/2.0" 200 979 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"

Screenshot der Ausgabe von curl -H "User-agent: GPTbot" auf den URL aus dem Access Log. Es ist HTML formatierter Text, ein Satz ist markiert:

"Und geschwätzig bei denselben Gelegenheiten, ob sie wirklich den Tod erlitten hatte, da schläferte man das Kind so heftig auf mich zu ihm gelangen. Jetzt setzte sich zu befreien, der sie wahrscheinlich auffressen trotz allen."

This is -ing unbelievable:
In the 17 hours running my "Discworld Ólyfjan" Iocaine, GPTBot has download the same 84 pages over 10000 times. They don't even change!

And Google has it on the search index: "Ólyfjan" [name of any discworld character]
has results.

HEX, the Bursar, even the troll Brick would be more intelligent than that...

#iocaine #aipoisoning #gptbot #chatgpt #discworld

Garbage results delivered by Google search for "Ólyfjan" Samuel Vimes

One of the things that annoys me the most is that the scraper that went furthest into the tarpit (83 links deep) is also the one who comes back reading the same pages again and again:

{host="olyfjan.blomi.is",user_agent="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)",user_agent_group="GPTBot"} has sent 6991 GET requests, for the same 84 pages, downloading 22779416 bytes.

#gptbot #aipoisoning #iocaine

“Since launching my GPT bot & Carrd site with AIFlowServices, I’ve tripled my leads.”
– Jasmine R., Marketing Coach
#aiflowservices #aiautomation #automatioworkflow #gptbot #carrdsite #tripledleads

Markov Tarpits: An Evolving Strategy Against #AI Crawlers

AI web crawlers like #GPTBot, #ClaudeBot, #Amazonbot, and others have become frequent visitors across the web. While gathering web content to power #LLMs, they now represent a significant portion of website traffic—in one case, reaching nearly 70% of total web requests.

As a direct response from the community, some developers have recently revived the tarpit #technology against AI web crawlers.

⚒️ https://oxylabs.io/blog/markov-tarpits-vs-ai-crawlers

analyzing logs from yesterday, it looks like #GPTBot ALONE (among other #crawlers) has been making several requests per second to my server FOR THE WHOLE DAY (during which I wasn't able to access my server, unfortunately). It constantly sent requests to the same page, over and over again, until I was able to block it.

#openia #boots

#GPTBot est utilisé pour rendre nos modèles de base d' #IA générative plus utiles et plus sûrs. Il est utilisé pour explorer le contenu qui peut être utilisé pour former nos modèles de base d'IA générative. Interdire GPTBot indique que le contenu d'un site ne doit pas être utilisé.

Et bien moi, c'est bloqué 😉
Robots de m*r*e 😎

52.230.152.0/24
52.233.106.0/24
20.171.206.0/24
20.171.207.0/24
4.227.36.0/25
172.182.193.160/28

https://platform.openai.com/docs/bots/

@baldur nodds in agreement at my current employer we had to block #OpenAI's entire IP ranges as they literally #DDoS'd a #customer with spoofed #UserAgent(s) [instead of using #GPTbot]…

It's really fucking annoying!

@khobochka guess why I maintain a #Scraper #blocklist?

In fact I know multiple people and organizations that decide to basically redirect #ValueRemoving #Scrapers like #GPTbot, #ByteSpider (which literally #DDoS'd #MattKC because #ClownFlare are a criminally incompetent #RogueISP!) to #Hetzner's 10GB Speedtest file which can be found at http://hil-speed.hetzner.com/10GB.bin as an extra middlefinger!

#Cloudflare #hetznered #ByteDance #ChatGPT

Now #OpenAI's rabid scraper bot #GPTBot is getting stuck in an endless URL concatenation loop again, this time it's on the principia-web forums. It's been going ever since last night.

I have no idea how you can mess up a crawler bot this badly, but I guess nobody cares if it goes havoc. Into the shitlist it goes.

Excerpt from access logs showing GPTBot getting stuck trying to request increasingly longer URLs.

Apart from everything else GPTBot is brutal on servers. Block that bad baby and block it good.

Info here (IP addresses and full user-gent string):
https://platform.openai.com/docs/bots

#gptbot

Just noticed, 18758 and counting requests from GPTBot (https://platform.openai.com/docs/gptbot) in the last two days on https://mirrors.sahilister.in/

#mirrors #openai #gptbot

Screenshot of terminal showing response of `grep -r "GPTBot" * | wc -l` command which is 18758.

Lol. #Amazon nutzt (soweit ich weiß) sehr gerne #KI-basierte Text-Generatoren so wie #ChatGPT - aber laut Amazon's robots.txt ( https://amazon.com/robots.txt ) soll der #ChatGPT-Bot namens #GPTBot nicht auf der Website von Amazon crawlen (Disallow: /), also nicht mit Daten von Amazon trainiert werden!

Wenn #Kreative ihr geistiges Eigentum von den #Trainingsdaten diverser KI-Modelle ausschliessen wollen, wird das Vorhaben schnell zu einer nervigen Lebensaufgabe. In meinem ersten Blog-Beitrag zu diesem Thema erfährst du, wie du den #GPTBot von #OpenAI (DALL-E) von deiner #Website oder Teilen der Site aussperren und wie du mit Hilfe des Opt-out-Formulares von OpenAI Werke aus den Trainingsdaten "entfernen" kannst.
#KI #AI #kuenstler #designer #fotografen #kreative #kunst
https://teufelswerk.net/kuenstler-designer-autoren-und-urheber-aufgepasst-teil-1-so-kannst-du-deine-werke-aus-den-ki-trainingsdaten-von-open-ai-dall-e-entfernen/

OpenAI, known for ChatGPT, is developing a web search engine powered by Microsoft's Bing called GPTBot. While it may not challenge Google immediately, it offers potential for organic traffic and brand awareness. Despite skepticism due to past failures of rivals like Bing, OpenAI's venture into search is noteworthy. Time will tell its impact in the competitive search landscape.

#OpenAI #GPTBot #WebSearch #Google #MicrosoftBing #AI

Major #news #publishers block the #bots as #ChatGPT starts taking #LiveNews – Independent Publishers Alliance urges members to block #GPTBot and #GoogleBard #crawler ASAP
via @Techmeme
https://pressgazette.co.uk/platforms/chatgpt-publishers-news-bing-google/

🌗 26% 的前 100 個網站現在正在封鎖 GPTBot
➤ 12 個熱門網站現在正在封鎖 GPTBot，而 Foursquare 則是一個大逆轉。
✤ https://searchengineland.com/more-popular-websites-blocking-gptbot-432531
一項最新的分析顯示，至少有 26 個前 100 個最受歡迎的網站和 242 個前 1,000 個網站現在正在封鎖 GPTBot，這是一個由 OpenAI 推出的網絡爬蟲，這是自上個月以來增加了 250%。更多的網站封鎖 GPTBot，可能是因為他們不希望 OpenAI 為訓練模型而爬取他們的數據，至少不希望沒有任何形式的補償。
+ 這是一個有趣的發展，因為越來越多的網站開始封鎖 GPTBot，這可能會影響 OpenAI 的模型訓練。
+ 封鎖 GPTBot 是否是一個好主意，這是一個值得討論的問題，因為這可能會影響 SEO 和網站流量。
#SEO #網站封鎖 #GPTBot

New York Times, CNN and Australia's ABC #block #OpenAI's #GPTBot #web #crawler from accessing #content – Chicago Tribune and Australian newspapers the Canberra Times and Newcastle Herald also appear to have disallowed web crawler from maker of #ChatGPT
https://www.theguardian.com/technology/2023/aug/25/new-york-times-cnn-and-abc-block-openais-gptbot-web-crawler-from-scraping-content

Est-ce légitime de récolter des pages Web pour entrainer des IA ?

https://www.bortzmeyer.org/collecte-pour-l-ia.html

#IA #ChatGPT #GPTBot #LLM #Copilot #AddOtherHashtags

#GPTBot

Client Info