#LLaVA

2025-11-10

Разбираю свой фотоархив

Сделал поиск по личному архиву фотографий с применением трех нейросетей, векторного расширения к PostgreSQL и Django

habr.com/ru/articles/963874/

#python #django #torch #pgvector #transformers #gigaembeddings #mistral #llava

2025-10-31
@Ralf S. Wohlgemerkt, der Hintergrund ist nicht Bildanalyse oder Bildinterpretation.

Der Hintergrund ist vielmehr Barrierefreiheit, wie sie auf Mastodon gefordert wird. Ich bin selbst nicht auf Mastodon, wie du sicherlich schon erkannt haben dürftest. Aber wenn meine Bildposts nach Mastodon kommen, und das tun sie, dann müssen sie schon deshalb barrierefrei sein, weil mich das ansonsten noch mehr Reichweite kosten würde als sowieso schon.

Nun bin ich allerdings niemand, der einfach nur das absolut nötige Minimum anstrebt. Statt dessen habe ich mich eingehend mit dem Thema Bildbeschreibungen und Alt-Text befaßt. Es gibt dazu ja sehr viele Publikationen online; etliche habe ich zusammengefaßt in meinem im Aufbau befindlichen Wiki zum Thema auf meinem Hubzilla-Kanal.

Allerdings gehen die nicht auf die tatsächlichen Verhältnisse im Fediverse ein, weder auf Mastodons ganz spezielle Kultur, die es versucht, dem ganzen übrigen Fediverse aufzuzwingen, noch auf die besonderen Wünsche zumindest einiger Mastodon-Nutzer noch auf die technischen Möglichkeiten im Fediverse außerhalb von Mastodon, z. B. Posts quasi ohne Zeichenlimit.

So mußte ich zusätzlich wachsamen Auges beobachten, was insbesondere auf Mastodon passiert in puncto Alt-Texte und Bildbeschreibungen. Ich würde gern im größeren Rahmen mit möglichst vielen Angehörigen verschiedener Nutzergruppen gleichzeitig über das Thema diskutieren. Aber alle Personen, mit denen darüber zu diskutieren sinnvoll wären, sind nur auf Mastodon. Mastodon ist technisch für diese Art von Diskussion völlig ungeeignet. Und im Fediverse außerhalb von Mastodon, wo es die technischen Voraussetzungen für solche Diskussionen gäbe (Friendica, Hubzilla, (streams), Forte, Lemmy, Mbin, PieFed, NodeBB etc.), ist das Thema praktisch unbekannt.

Selbst wenn ich einfach so "in den Äther" rufe, wie es auf Mastodon üblich ist, weil es da gar nicht anders geht, kommt nichts dabei heraus. Als Nicht-Mastodon-Nutzer habe ich kurioserweise mit ca. über 700 Folgeverbindungen weitaus weniger Reichweite als so manch ein Mastodon-Nutzer mit 300 Folgenden. Abstimmungen bringen auch nichts; häufig stimmen bei mir weniger Leute ab, als ich Optionen angegeben habe.

Also muß ich beim Beschreiben meiner Bilder von sechs Annahmen ausgehen, die ich in diesem bisher komplett ignorierten Post schon dargelegt habe:

  1. Mein Publikum besteht nicht nur aus denen, die mir folgen, sondern das sind alle, die theoretisch meine Posts sehen können.
  2. Wenn ich erwähne, daß es auf einem meiner Bilder etwas gibt, dann muß ich auch beschreiben, wie es aussieht.
  3. Bildbeschreibungen müssen sofort alle Informationen liefern, die vielleicht irgendjemand da draußen brauchen könnte. Nach einem Detail in einem Bild oder einer Erklärung für ein Bild zu fragen, ist genauso schlimm, wie überhaupt erst nach einem Alt-Text zu fragen.
  4. Irgendjemand da draußen ist möglicherweise auch an kleinsten Details auf meinen Bildern interessiert. Und der- oder diejenige ist möglicherweise blind oder sehbehindert.
  5. Alles, was es an Text innerhalb der Grenzen eines Bildes gibt, muß immer 100% wortwörtlich transkribiert werden. Auch wenn der Text unlesbar ist oder so klein ist, daß er unsichtbar ist. Wenn ich weiß, was da geschrieben steht, dann muß ich es transkribieren.
  6. Alle Bilder brauchen einen akkuraten und hinreichend detaillierten tatsächlichen Alt-Text. Auch wenn ich ein Bild in 60.000 Zeichen im Post selbst beschreibe, kann ich dafür sanktioniert werden, daß das Bild selbst keinen akkuraten und hinreichend detaillierten Alt-Text hat. Also brauche ich den zusätzlich. Ich muß meine eigenen Bilder jeweils zweimal beschreiben.

Im übrigen kann ein LLM nicht annähernd das, was ich tue. Und das weiß ich aus eigener praktischer Erfahrung: Ich habe zwei mal LLaVA damit beauftragt, ein Bild zu beschreiben, das ich schon beschrieben habe.

Das fängt schon damit an, daß keine KI auf dem Bild selbst Details sehen kann, die ich sehen kann, wenn ich vor Ort bin. Die KI würde ja das Bild beschreiben, indem sie sich das Bild von diesem Ort ansieht. Ich beschreibe meine Bilder, indem ich mir den Ort selbst vor Ort ansehe, also eben gerade nicht das Bild mit seiner stark reduzierten Auflösung. Eine KI kann das nicht.

Dann gehört zum akkuraten Beschreiben und vor allem Erklären dieser Bilder extrem obskures Nischenwissen. Keine KI könnte bei der visuellen Analyse eines meiner Bilder erkennen und erklären, was das für ein Ort ist, wie die Sim heißt, in welchem Grid sie sich befindet, daß das Ganze auf OpenSim basiert usw. usf. Schon gar nicht können das alle KIs. Diese Informationen sind ganz einfach zu obskur, und sie verändern sich auch schnell.

Ein extremer Fall ist wahrscheinlich die Beschreibung in diesem Bildpost: Die Sim war zu dem Zeitpunkt erst wenige Tage oder vielleicht ein paar Wochen alt. Ich habe innerhalb der Bildbeschreibung eine sehr detaillierte Beschreibung eines Bildes auf diesem Bild, das nur wenige hundert Pixel groß ist. Ich habe die Sim nicht nur korrekt identifiziert, sondern auch den populärkulturellen Bogen von dieser Sim über Edgar Wallace bis hin zum Frühstyxradio auf ffn und daraus abgeleiteten Kinofilmen geschlagen. Das Objekt zur rechten Seite hin habe ich alleine in etwa 1.000 Zeichen beschrieben und in noch einmal 4.000 Zeichen eingehend erläutert.

Dasselbe Bild habe ich LLaVA zum Beschreiben angeboten und anschließend die Beschreibung von LLaVA eingehend analysiert. Sie ist weit von meiner Beschreibung entfernt und davon, akkurat und detailliert zu sein. Dieses besagte Objekt, dem ich über 5.000 Zeichen gewidmet habe, hat LLaVA gänzlich ignoriert.

Mir kann niemand erzählen, ein anderes LLM könnte es wesentlich besser oder sogar noch besser, noch detaillierter, noch informativer, noch kompetenter und noch akkurater als ich.

CC: @wolf

#Long #LongPost #CWLong #CWLongPost #LangerPost #CWLangerPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #Hubzilla #Streams #(streams) #AltText #AltTextMeta #CWAltTextMeta #Bildbeschreibung #Bildbeschreibungen #BildbeschreibungenMeta #CWBildbeschreibungenMeta #KI #LLM #KIGegenMensch #MenschGegenKI #LLaVA
amir2000.nl 🎗🇮🇱🎗🇮🇱amir2000@mstdn.social
2025-08-19

Major update to my photography workflow. From folder to gallery in one flow.
• Multi Set lets me queue many subjects in one go
• Review slider makes approvals fast and consistent
• LLaVA prefill drafts captions and keywords locally
The result is fewer restarts, faster reviews and cleaner metadata.
Full post with screenshots → amir2000.nl/blog/From-folder-t

#CanonR5MarkII #AMIR2000NLPhotography #Photography #Automation #Workflow #Ollama #LLaVA #Python

2025-07-29

@mast4sc Danke, coding mache ich nicht so viel (wenn, in #Cursor ) und #Llava kannte ich noch nicht 🤗

2025-07-16

Nowe badanie Apple: AI, która rozumie interfejsy aplikacji jak człowiek

Naukowcy z Apple, we współpracy z fińskim Uniwersytetem Aalto, zaprezentowali nowy model sztucznej inteligencji o nazwie ILuvUI.

Jest to model wizualno-językowy (VLM), który został specjalnie wytrenowany, aby rozumieć i logicznie analizować interfejsy użytkownika (UI) aplikacji mobilnych na podstawie zrzutów ekranu i rozmów w języku naturalnym. W testach porównawczych nowy model okazał się lepszy od otwartego oprogramowania, na którym bazował.

Większość obecnych modeli wizualno-językowych jest trenowana na tzw. obrazach naturalnych, takich jak zdjęcia psów czy znaków drogowych. W rezultacie radzą sobie one znacznie gorzej, gdy mają do czynienia ze zorganizowanymi środowiskami, jakimi są interfejsy aplikacji. Jak wyjaśniają badacze, samo analizowanie tekstu w UI nie wystarcza, ponieważ pomija bogatą informację wizualną, a to właśnie połączenie obu tych warstw jest kluczowe dla pełnego zrozumienia kontekstu, podobnie jak u ludzi.

Aby rozwiązać ten problem, zespół naukowców wziął istniejący, otwarty model VLM o nazwie LLaVA i dostroił go specjalnie do analizy interfejsów użytkownika. Kluczowe było wytrenowanie go na syntetycznie wygenerowanym zbiorze danych, który zawierał pary obrazów (zrzutów ekranu) i powiązanych z nimi tekstów. W skład tego zbioru wchodziły m.in. interakcje w formie pytań i odpowiedzi, szczegółowe opisy ekranów, przewidywane wyniki działań, a nawet wieloetapowe plany (np. „jak posłuchać najnowszego odcinka podcastu” lub „jak zmienić ustawienia jasności”). Co istotne, ILuvUI potrafi analizować cały ekran na podstawie prostej komendy tekstowej, bez potrzeby wskazywania przez użytkownika konkretnego obszaru zainteresowania.

Według badaczy Apple, ich podejście może okazać się niezwykle przydatne w dwóch głównych obszarach: dostępności (ułatwienia dostępu dla osób z niepełnosprawnościami) oraz zautomatyzowanego testowania interfejsów aplikacji. W przyszłości prace mogą objąć wykorzystanie większych koderów obrazu i obsługę wyższych rozdzielczości, a także generowanie wyników w formatach (np. JSON), które będą mogły płynnie współpracować z istniejącymi frameworkami UI.

#AI #Apple #badaniaNaukowe #dostępność #ILuvUI #interfejsUżytkownika #LLaVA #news #sztucznaInteligencja #uczenieMaszynowe #UI

Apple AI interfejsy
Peter B.p3ter
2025-06-16

For my presentation & workshop in Bologna end of the week, I've put together a clean "" setup:

: with , and in and files on another USB3

Fast like a local install. Given enough RAM and CPU/GPU.

Scary how easy it is to do that.

Bonus: Creates self-replicating boot images of itself (in current configuration) in less than 5 minutes. 🤯 😎 .

Wow.

2025-03-17
@*_jayrope Von KI für Bildbeschreibungen für meine Bilder halte ich gar nichts. Meine Bilder sind über ein extremes Nischenthema, und um sie zu beschreiben und zu erklären, braucht es extremes Nischenwissen.

Ich hab's demonstrativ zweimal mit LLaVA versucht. Dieses Bild habe ich selbst in acht Stunden und über 25.000 Zeichen beschrieben. LLaVA hat in vielleicht einer halben Minute etwas über 550 Zeichen rausgetan. Hier ist der direkte Vergleich.

Fazit:
  • Ich konnte haarklein erklären, wo das Bild gemacht wurde, und zwar so, daß es auch ein Laie versteht. Der Ort, die Sim, das Grid, die Software darunter, was es mit der Software auf sich hat, was ein Grid ist, was eine Sim ist usw. usf. etc. pp. Das ist es, was die Leute wissen müssen. LLaVA konnte nur grob spekulieren.
  • Ich habe den Avatar recht detailliert beschrieben, LLaVA überhaupt nicht und ihn als "Charakter" bezeichnet.
  • LLaVA hat darüber halluziniert, wohin der Avatar guckt. Ich wußte es, obwohl es von hinten gar nicht sichtbar ist. Ich konnte auch sagen, warum der Avatar von hinten zu sehen ist.
  • LLaVA hat Text in dem Bild nicht mal gefunden. Ich habe fast alles an Text, was für mich irgendwie lesbar war. wortwörtlich transkribiert. Ich glaube, ich habe nur ein Nummernschild über einer Tür auf einem Bild in diesem Bild vergessen.
  • Ich habe von dem Ort den popkulturellen Bogen zu Edgar Wallace und zum Frühstyxradio schlagen können. LLaVA nicht.
  • LLaVA hat nicht mal geschrieben, daß das Bild schwarzweiß ist. Ich habe korrekt geschrieben, daß in Wahrheit alles in-world von der Szenerie bis hin zum Avatar schwarzweiß ist.
  • Dieses ominöse Gebilde rechts im Bild? LLaVA hat es nicht mal wahrgenommen. Ich habe es in 3000 Zeichen beschrieben und in weiteren 2500 Zeichen erklärt.
  • Bei der Tageszeit hat LLaVA sich total verhauen, weil es weder wußte, in welche Richtung die Kamera zeigt, noch, wie hoch die Bäume eigentlich sind, die die Schatten werfen.
  • LLaVA hat auch keinen der Bäume als Bergkiefer identifizieren können.

Das wird der Altbot nicht signifikant besser können.

#Long #LongPost #CWLong #CWLongPost #LangerPost #CWLangerPost #Bildbeschreibung #Bildbeschreibungen #BildbeschreibungenMeta #CWBildbeschreibungenMeta #AI #KI #LLaVA #Altbot
2025-02-22
@Anna Maier I don't know what constitutes a "good" example in your opinion, but I've got two examples of how bad AI is at describing images with extremely obscure niche content, much less explaining them.

In both cases, I had the Large Language and Vision Assistant describe one of my images, always a rendering from within a 3-D virtual world. And then I compared it with a description of the same image of my own.

That said, I didn't compare the AI description with my short description in the alt-text. I went all the way and compared it with my long description in the post, tens of thousands of characters long, which includes extensive explanations of things that the average viewer is unlikely to be familiar with. This is what I consider the benchmark.

Also, I fed the image at the resolution at which I posted it, 800x533 pixels, to the AI. But I myself didn't describe the image by looking at the image. I described it by looking around in-world. If an AI can't zoom in indefinitely and look around obstacles, and it can't, it's actually a disadvantage on the side of the AI and not an unfair advantage on my side.

So without further ado, exhibit A:

This post contains
  • an image with an alt-text that I've written myself (1,064 characters, including only 382 characters of description and 681 characters of explanation where the long description can be found),
  • the image description that I had LLaVA generate for me (558 characters)
  • my own long and detailed description (25,271 characters)
The immediate follow-up comment dissects and reviews LLaVA's description and reveals where LLaVA was too vague, where LLaVA was outright wrong and what LLaVA didn't mention although it should have.

If you've got some more time, exhibit B:

Technically, all this is in one thread. But for your convenience, I'll link to the individual messages.

Here is the start post with
  • an image with precisely 1,500 characters of alt-text, including 1,402 characters of visual description and 997 characters mentioning the long description in the post, all written by myself
  • my own long and detailed image description (60,553 characters)

Here is the comment with the AI description (1,120 characters; I've asked for a detailed description).

Here is the immediate follow-up comment with my review of the AI description.

#Long #LongPost #CWLong #CWLongPost #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA #AIVsHuman #HumanVsAI
Yann Büchau :nixos:nobodyinperson@fosstodon.org
2024-12-23

@Triffen Okay I might try it out 😅 But I don't know how good #llava is at detecting whether something was made with #generativeAI...

Yann Büchau :nixos:nobodyinperson@fosstodon.org
2024-12-23

Being constantly annoyed by all the generated images for blog posts and articles (seriously, STOP!) I thought about making something with #ollama #llava that identifies, then hides them from me - when I realized that the solution to less #generativeAI can't be *more* #LLM's 😓

2024-11-08

Понимает ли Vision Llama импрессионистов?

Всем привет, меня зовут Арсений, я Data Scientist в компании Raft, и сегодня я расскажу вам про Visual Language Models (VLM). Большие языковые модели уже стали частью нашей жизни и мы применяем их, чтобы упростить современную рутину, а так же используем для решения бизнес задач. Недавно вышло новое поколение vision transformer моделей, которые заметно упростили анализ изображений, из какой бы сферы эти изображения не были. Особенно заметным был сентябрьский релиз Llama-3.2-11b, и не только потому что это первая vision модель от Llama, сколько потому, что с ней вместе вышло целое семейство моделей, включая маленькие на 1B и 3B параметров. А как вы знаете, меньше, значит юзабельнее.

habr.com/ru/companies/raft/art

#Vision_Transformers #Vision_Language_Models #multimodal_llm #Llama32 #qwen2vl #llava #art #art_history

2024-11-07

🔍 Major breakthrough in multimodal AI research:

#InfinityMM dataset launches with 43.4M entries across 4 categories: 10M image descriptions, 24.4M visual instructions, 6M high-quality instructions & 3M #AI generated data

🧠 Technical highlights:

New #AquilaVL2B model uses #LLaVA architecture with #Qwen25 language model & #SigLIP for image processing
Despite only 2B parameters, achieves state-of-the-art results in multiple benchmarks
Exceptional performance: #MMStar (54.9%), #MathVista (59%), #MMBench (75.2%)

🚀 Training innovation:

4-stage training process with increasing complexity
Combines image recognition, instruction classification & response generation
Uses #opensource models like RAM++ for data generation

💡 Industry impact:

Model trained on both #Nvidia A100 GPUs & Chinese chips
Complete dataset & model available to research community
Shows promising results compared to commercial systems like #GPT4V

arxiv.org/abs/2410.18558

2024-10-31

VLM — арт эксперты

Всем привет, меня зовут Арсений, я DS в компании Raft, и сегодня я расскажу вам про VLM. Большие языковые модели уже стали частью нашей жизни и мы применяем, чтобы упростить современную рутину, а так же используем их для решения бизнес задач. Недавно вышло новое поколение vision transformer моделей, которые заметно упростили анализ изображений, из какой бы сферы эти изображения не были. Особенно заметным был сентябрьский релиз Llama-3.2-11b, и не только потому что это первая vision модель от Llama, сколько потому, что с ней вместе вышло целое семейство моделей, включая маленькие на 1B и 3B параметров. А как вы знаете, меньше, значит юзабельнее.

habr.com/ru/articles/854864/

#transformers #VLM #Vision_Transformer #llama_32 #llava #qwen2vl #art #art_history #искусственный_интеллект

Dr. Fortyseven 🥃 █▓▒░fortyseven@defcon.social
2024-09-23

Moving to #Llava drastically improved the quality of the generated commentary. 😅

Llava sees:
> The image features a large sign with information about the game "Crab Champions." The sign is placed on a sandy surface, possibly a beach. The sign provides details about the game, including updates, new features, and other relevant information. The sign is likely used to inform players about the game's progress and upcoming changes.

Generated, spoken commentary:
> _"Holy crap, look at that sign, it's like a beachside billboard from hell, littered with so much text I'm pretty sure it's gonna get its own zip code, updates on Crab Champions are apparently gonna change the game forever."_

#ai #llm

2024-09-04

#Groq Introduces LLaVA V1.5 7B on #GroqCloud 🚀🖼️

#LLaVA: Large Language and #Vision Assistant 🗣️👁️
- Combines #OpenAI's #CLIP and #Meta's #Llama2
- Supports #image, #audio, and #text modalities

Key Features:
- Visual #Question Answering 🤔
- Caption Generation 📝
- Optical Character Recognition 🔍
- Multimodal #Dialogue 💬

Available now on #GroqCloud #Developer Console for #multimodal #AI innovation 💻🔧

groq.com/introducing-llava-v1-

Judith van Stegerenjd7h@fosstodon.org
2024-09-04

Really cool to encounter "our" LLaVA (Llama 2 + vision) in the official Replicate docs, which Yorick van Pelt and I deployed in the week it was released. 😍

#replicate #llava #llama #genai

Screenshot of the Replicate.com docs for running a model with Python on their machine learning platform. The docs mention model-id yorickvp/llava-13b.
2024-08-12
@Michal Bryxí 🌱 And while I'm at it, here's a quote-post of my comment in which I review the second AI description.

Jupiter Rowland wrote the following post Sat, 18 May 2024 00:24:46 +0200 It's almost hilarious how clueless the AI was again. And how wrong.

First of all, the roof isn't curved in the traditional sense. The end piece kind of is, but the roof behind it is more complex. Granted, unlike me, the AI can't look behind the roof end, so it doesn't know.

Next, the roof end isn't reflective. It isn't even glossy. And brushed stainless steel shouldn't really reflect anything.

The AI fails to count the columns that hold the roof end, and it claims they're evenly spaced. They're anything but.

There are three letters "M" on the emblem, but none of them is stand-alone.There is visible text on the logo that does provide additional context: "Universal Campus", "patefacio radix" and "MMXI". Maybe LLaVA would have been able to decipher at least the former, had I fed it the image at its original resolution of 2100x1400 pixels instead of the one I've uploaded with a resolution of 800x533 pixels. Decide for yourself which was or would have been cheating.

"Well-maintained lawn". Ha. The lawn is painted on, and the ground is so bumpy that I wouldn't call it well-maintained.

The entrance of the building is visible. In fact, three of the five entrances are. Four if you count the one that can be seen through the glass on the front. And the main entrance is marked with that huge structure around it.

The "few scattered clouds" are mostly one large cloud.

At least LLaVA is still capable of recognising a digital rendering and tells us how. Just you wait until PBR is out, LLaVA.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱 And since you obviously haven't actually read anything I've linked to, here's a quote-post of my comment in which I dissect the first AI description.

Jupiter Rowland wrote the following post Tue, 05 Mar 2024 20:28:12 +0100 (This is actually a comment. Find another post further up in this thread.)

Now let's pry LLaVA's image description apart, shall we?

The image appears to be a 3D rendering or a screenshot from a video game or a virtual environment.

Typical for an AI: It starts vague. That's because it isn't really sure what it's looking at.

This is not a video game. It's a 3-D virtual world.

At least, LLaVA didn't take this for a real-life photograph.

It shows a character

It's an avatar, not a character.

standing on a paved path with a brick-like texture.

This is the first time that the AI is accurate without being vague. However, there could be more details to this.

The character is facing away from the viewer,

And I can and do tell the audience in my own image description why my avatar is facing away from the viewer. Oh, and that it's the avatar of the creator of this picture, namely myself.

looking towards a sign or information board on the right side of the image.

Nope. Like the AI could see the eyeballs of my avatar from behind. The avatar is actually looking at the cliff in the background.

Also, it's clearly an advertising board.

The environment is forested with tall trees and a dense canopy, suggesting a natural, possibly park-like setting.

If I'm generous, I can let this pass as not exactly wrong. Only that there is no dense canopy, and this is not a park.

The lighting is subdued, with shadows cast by the trees, indicating either early morning or late afternoon.

Nope again. It's actually late morning. The AI doesn't know because it can't tell that the Sun is in the southeast, and because it has got no idea how tall the trees actually are, what with almost all treetops and half the shadow cast by the avatar being out of frame.

The overall atmosphere is calm and serene.

In a setting inspired by thrillers from the 1950s and 1960s. You're adorable, LLaVA. Then again, it was quiet because there was no other avatar present.

There's a whole lot in this image that LLaVA didn't mention at all. First of all, the most blatant shortcomings.

First of all, the colours. Or the lack of them. LLaVA doesn't say with a single world that everything is monochrome. What it's even less aware of is that the motive itself is monochrome, i.e. this whole virtual place is actually monochrome, and the avatar is monochrome, too.

Next, what does my avatar look like? Gender? Skin? Hair? Clothes?

Then there's that thing on the right. LLaVA doesn't even mention that this thing is there.

It doesn't mention the sign to the left, it doesn't mention the cliff at the end of the path, it doesn't mention the mountains in the background, and it's unaware of both the bit of sky near the top edge and the large building hidden behind the trees.

And it does not transcribe even one single bit of text in this image.

And now for what I think should really be in the description, but what no AI will ever be able to describe from looking at an image like this one.

A good image description should mention where an image was taken. AIs can currently only tell that when they're fed famous landmarks. AI won't be able to tell from looking at this image that it was taken at the central crossroads at Black White Castle, a sim in the OpenSim-based Pangea Grid anytime soon. And I'm not even talking about explaining OpenSim, grids and all that to people who don't know what it is.

Speaking of which, the object to the right. LLaVA completely ignores it. However, it should be able to not only correctly identify it as an OpenSimWorld beacon, but also describe what it looks like and explain to the reader what an OpenSimWorld beacon is, what OpenSimWorld is etc. because it should know that this can not be expected to be common knowledge. My own description does that in round about 5,000 characters.

And LLaVA should transcribe what's written on the touch screen which it should correctly identify as a touch screen. It should also mention the sign on the left and transcribe what's written on it.

In fact, all text anywhere within the borders of the picture should be transcribed 100% verbatim. Since there's no rule against transcribing text that's so small that it's illegible or that's so tiny that it's practically invisible or that's partially obscured or partially out of frame, a good AI should be capable of transcribing such text 100% verbatim in its entirety as well. Unless text is too small for me to read in-world, I can and do that.

And how about not only knowing that the advertising board is an advertising board, but also mentioning and describing what's on it? Technically speaking, there's actually a lot of text on that board, and in order to transcribe it, its context needs to be described. That is, I must admit I was sloppy myself and omitted a whole lot of transcriptions in my own description.

Still, AI has a very very long way to go. And it will never fully get there.

#Long #LongPost #CWLong #CWLongPost #AltText #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱
Without any context

The context matters. A whole lot.

A simple real-life cat photograph can be described in a few hundred characters, and everyone knows what it's all about. It doesn't need much visual description because it's mainly only the cat that matters. Just about everyone knows what real-life cats generally look like, except from the ways they differ from one another. Even people born 100% blind should have a rough enough idea what a cat is and what it looks like from a) being told it if they inquire and b) touching and petting a few cats.

Thus, most elements of a real-life cat photograph can safely be assumed to be common knowledge. They don't require description, and they don't require explanation because everyone should know what a cat is.

Now, let's take the image which LLaVA has described in 558 characters, and which I've previously described in 25,271 characters.

For one, it doesn't focus on anything. It shows an entire scene. If the visual description has to include what's important, it has to include everything in the image because everything in the image is important just the same.

Besides, it's a picture from a 3-D virtual world. Not from the real world. People don't know anything about this kind of 3-D virtual worlds in general, and they don't know anything about this place in particular. In this picture, nothing can safely be assumed to be common knowledge. For blind or visually-impaired users even less.

People may want to know where this image was made. AI won't be able to figure that out. AI can't examine that picture and immediately and with absolute certainty recognise that it was created on a sim called Black-White Castle on an OpenSim grid named Pangea Grid, especially seeing as that place was only a few days old when I was there. LLaVA wasn't even sure if it's a video game or a virtual world. So AI won't be able to tell people.

AI doesn't know either whether or not any of the location information can be considered common knowledge and therefore necessarily to explain so humans will understand it.

I, the human describer, on the other hand, can tell people where exactly this image was made. And I can explain it to them in such a way that they'll understand it with zero prior knowledge about the matter.

Next point: text transcripts. LLaVA didn't even notice that there is text in the image, much less transcribe it. Not transcribing every bit of text in an image is sloppy; not transcribing any text in an image is ableist.

No other AI will even be able to transcribe the text in this image, however. That's because no AI can read any of it. It's all too small and, on top of that, too low-contrast for reliable OCR. All that AI has is the image I've posted at a resolution of 800x533 pixels.

I myself can see the scenery at nigh-infinite resolution by going there. No AI can do that, and no LLM AI will ever be able to do that. And so I can read and transcribe all text in the image 100% verbatim with 100% accuracy.

However, text transcripts require some room in the description, also because they additionally require descriptions of where the text is.

I win again. And so does the long, detailed description.

Would you rather have alt text that is:

I'm not sure if this is typical Mastodon behaviour because it's impossible for Mastodon users to imagine that images can be described elsewhere than in the alt-text (they can, and I have), or if it's intentional trolling.

The 25,271 characters did not go into the alt-text! They went into the post.

I can put so many characters into a post. I'm not on Mastodon. I'm on Hubzilla which has never had and still doesn't have any character limits.

In the alt-text, there's a separate, shorter, still self-researched and hand-written image description to satisfy those who absolutely demand there be an image description in the alt-text.

25,271 characters in alt-text would cause Mastodon to cut 23,771 characters off and throw them away.

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱
Prediction: Alt text will be generated by AI directly on the consumer's side so that *they* can tell what detail, information density, parts of the picture are important for *them*. And pre-written alt text will be frowned upon.

Won't happen.

Maybe AI sometimes happens to be as good as humans when it comes to describing generic, everyday images that are easy to describe. By the way, I keep seeing AI miserably failing to describe cat photos.

But when it comes to extremely obscure niche content, AI can only produce useless train wrecks. And this will never change. When it comes to extremely obscure niche content, AI not only requires full, super-detailed, up-to-date-by-the-minute knowledge of all aspects of the topic, down to niches within niches within the niche, but it must be able to explain it, and it must know that and inhowfar it's necessary to explain it.

I've pitted LLaVA against my own hand-written image descriptions. Twice. Not simply against the short image descriptions in my alt-texts, but against the full, long, detailed, explanatory image descriptions in the posts.

And LLaVA failed so, so miserably. What little it described, it often got it wrong. More importantly, LLaVA's descriptions were nowhere near explanatory enough for a casual audience with no prior knowledge in the topic to really understand the image.

500+ characters generated by LLaVA in five seconds are no match against my own 25,000+ characters that took me eight hours to research and write.

1,100+ characters generated by LLaVA in 30 seconds are no match against my own 60,000+ characters that took me two full days to research and write.

When I describe my images, I put abilities to use that AI will never have. Including, but not limited to the ability to join and navigate 3-D virtual worlds. Not to mention that an AI would have to be able to deduce from a picture where exactly a virtual world image was created, and how to get there.

So no, ChatGPT won't write circles around me by next year. Or ever. Neither will any other AI out there.

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst