#HumanVsAI

N-gated Hacker Newsngate
2025-04-30

🤖🎉 "AI-first" is the new buzzword du jour for tech execs who think their human staff are basically glorified paperweights. So long, "return to office"—hello, AI overlords! If only they could find an AI to manage their own brilliant ideas. 🙄💡
anildash.com//2025/04/19/ai-fi

2025-04-20
@-0--1- @David G. Smith If anything, the AI to describe the image should be chooseable, and the available AIs should be configurable at least for the admin. And especially, AI image description must not be mandatory and hard-coded. There must always be a way to describe an image manually, no matter how many people swear that AI is better at describing any image out there than any human.

#Long #LongPost #CWLong #CWLongPost #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #AIVsHuman #HumanVsAI
2025-02-22
@Anna Maier I don't know what constitutes a "good" example in your opinion, but I've got two examples of how bad AI is at describing images with extremely obscure niche content, much less explaining them.

In both cases, I had the Large Language and Vision Assistant describe one of my images, always a rendering from within a 3-D virtual world. And then I compared it with a description of the same image of my own.

That said, I didn't compare the AI description with my short description in the alt-text. I went all the way and compared it with my long description in the post, tens of thousands of characters long, which includes extensive explanations of things that the average viewer is unlikely to be familiar with. This is what I consider the benchmark.

Also, I fed the image at the resolution at which I posted it, 800x533 pixels, to the AI. But I myself didn't describe the image by looking at the image. I described it by looking around in-world. If an AI can't zoom in indefinitely and look around obstacles, and it can't, it's actually a disadvantage on the side of the AI and not an unfair advantage on my side.

So without further ado, exhibit A:

This post contains
  • an image with an alt-text that I've written myself (1,064 characters, including only 382 characters of description and 681 characters of explanation where the long description can be found),
  • the image description that I had LLaVA generate for me (558 characters)
  • my own long and detailed description (25,271 characters)
The immediate follow-up comment dissects and reviews LLaVA's description and reveals where LLaVA was too vague, where LLaVA was outright wrong and what LLaVA didn't mention although it should have.

If you've got some more time, exhibit B:

Technically, all this is in one thread. But for your convenience, I'll link to the individual messages.

Here is the start post with
  • an image with precisely 1,500 characters of alt-text, including 1,402 characters of visual description and 997 characters mentioning the long description in the post, all written by myself
  • my own long and detailed image description (60,553 characters)

Here is the comment with the AI description (1,120 characters; I've asked for a detailed description).

Here is the immediate follow-up comment with my review of the AI description.

#Long #LongPost #CWLong #CWLongPost #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA #AIVsHuman #HumanVsAI
2025-01-07
@Blue Headline - Tech News I've let AI describe two of my images. After I've described them myself, each one twice even. It was just to see what'd happen.

The results were about as pathetic as expected. What the AI whipped up was incomplete and inaccurate. It wasn't even a match for my short descriptions which I had written for the alt-texts, much less for my long descriptions which went directly into the posts.

Maybe AI can describe a cat photo (and I've seen it fail even at that task). But AI will never be as good at describing images showing extremely obscure niche content as someone who really knows that particular niche topic inside-out. And I'm not even talking about explaining images in addition to describing them.

#Long #LongPost #CWLong #CWLongPost #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #AIVsHuman #HumanVsAI
Blue Headline - Tech NewsBlueHeadline
2024-12-07

🤖 Humans trust their own work more than AI, even when AI performs just as well.

Why? It’s all about trust and collaboration. Humans bring creativity, while AI brings precision. Together, they’re unstoppable!

🔗 Learn how we can close the trust gap: blueheadline.com/tech-news/bia

What do you think—do you trust AI outputs? Share your thoughts below!

2024-10-19
@Jeffrey D. Stark But you shouldn't have to.

Besides, I'd like to see an AI accurately transcribe two lines of texts that take up six pixels in width and one pixel in height in an image. Or accurately recognise the texture on the clothes of my avatar at this resolution and this lighting.

Again, I can do both.

#ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #A11y #Accessibility #AI #AIVsHuman #HumanVsAI
2024-10-06
Mastodon is increasingly moving from #NoAltTextNoBoost to #CrappyAltTextNoBoost, and I can see it move further to #NotEnoughAltTextNoBoost.

It is moving from only ostracising people for not providing image descriptions past ostracising people for providing useless image descriptions towards ostracising people for providing AI-generated image descriptions because they're at least partially wrong. The next victims may be people whose image descriptions leave out elements in the image which others may deem necessary to describe.

As quality requirements for image descriptions are being raised, I can't possibly lower the quality of my own image descriptions. If anything, I'll continue to upgrade my own image descriptions to stay ahead.

This is also why I'm worried about moving the long descriptions from the post text body into linked external documents. Not having certain descriptions and any explanations anywhere in the post anymore may backfire, and the external documents themselves may not be accessible and inclusive after all.

Interestingly, this is not congruent with what I read from actually non-sighted people. They don't even seem to care for accuracy which they can't verify anyway as long as the image description is funny and/or whimsical. Since it seems to be exactly that what AI delivers, it's no wonder that many blind people prefer image descriptions from BeMyAI over image descriptions from human experts.

I think I'll keep on writing my monster descriptions, two for each original image. If any of you who aren't sighted don't like them for not being whimsical enough, feel free to ignore the hours or days of work I've put into them, fire up your AI and have your own image description generated.

@accessibility group @a11y group

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #Mastodon #MastodonPolice #FediPolice #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #Blind #VisuallyImpaired #A11y #Accessibility #AI #AIVsHuman #HumanVsAI
2024-10-05
So you think AI is always better at describing and even explaining any image out there than any human, including insiders and experts on the very topic shown in the image? 100% accurately? And at a higher level of detail than said humans?

Well, here's an image.

I'd like to see AI identify the place shown in the image as the central crossing at BlackWhite Castle and identify BlackWhite Castle as a standard-region-sized sim on Pangea Grid, a virtual world or so-called "grid" based on OpenSimulator.

I'd like to see AI explain the above, all the way down to a level that can easily be understood by someone who has only got a rough idea about what virtual worlds are.

I'd like to see AI correctly mention the pop-cultural links from this sim to German Edgar Wallace films and Frühstyxradio that are obvious to me.

I'd like to see AI correctly identify the avatar in the middle. By name. And know that identifying the avatar is appropriate in this context.

I'd like to see AI know and tell the real reason why the avatar is only shown from behind.

I'd like to see AI recognise that the image was not edited into monochrome, but it's actually both the avatar and the entire sim with everything on and around it that's monochrome.

I'd like to see AI transcribe text that's unreadable in the image. 100% accurately verbatim, letter by letter.

I'd like to see AI identify the object to the right and explain its origin, its purpose and its functionality in detail.

I'd like to see AI discover and mention the castle in the background.

I'd like to see AI accurately figure out whether it's necessary to explain any of the above to the expected audience and, if correctly deemed necessary, do so. And explain the explanations if correctly deemed necessary.

I'd like to see AI know and correctly mention which direction the camera is facing.

Finally, I'd like to see AI automatically generate two image descriptions, a full and detailed one with all explanations and a shorter one that can fit into 1,500 characters minus the number of characters necessary to mention the full description and explain its location.

When I posted the same image, I did all of the above. And more.

In fact, if AI is supposed to be better than me, I expect it to identify all trees in the image, not only the mountain pines, to give an even more detailed description of the motel advertisement and to give a much more detailed description of the map, including verbatim transcripts of all text on it and accurate information on what place is shown on the map in the first place.

If AI is supposed to be better than me, I expect it to
  • describe, explain and transcribe everything that I describe, explain and transcribe
  • describe, explain and transcribe even more on top of that
  • even more accurately than I do
  • more whimiscally
  • and in much fewer characters.

All, by the way, fully automatically with no human intervention except for maybe a simple prompt to describe the image for a certain Fediverse project.

#Long #LongPost #CWLong #CWLongPost #OpenSim #OpenSimulator #Metaverse #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #A11y #Accessibility #AI #AIVsHuman #HumanVsAI
2024-09-06
@Charlotte Joanne I'm among those who want to do what's right myself. But I go to greater lengths.

Describing my images at all doesn't cut it for me. I go into small details for those who would love to go exploring a whole new and unknown world just by looking at my images, but who can't see them. If something about my images isn't common knowledge, I explain it, I explain the explanation and so forth, just because I think having people look such things up is not accessible and not inclusive.

Whenever someone likes something about someone else's image descriptions, I try to implement it myself. That's why I go as far as describing the position, height and angle of the camera if it's out of the ordinary.

I also try hard to follow as many rules of good image descriptions as possible and follow them to a tee. I transcribe text that's impossible to read. There's a rule that all text within the borders of an image must be transcribed, but there's none about unreadable text. I always prefer to err on the side of too much.

The only rule I break is the rule that alt-text must be as short as possible, preferably no longer than 200 characters. But that rule conflicts with the other rules, with what seems to make an image description good and with what my images in particular need. And besides, there seem to be more people in the Fediverse who like detailed descriptions than people who insist in short alt-text.

And so I fill the alt-text up to the limit of 1,500 characters imposed by several Fediverse projects. And what goes into the alt-text is actually already greatly shortened from the sheer monstrosity of a detailed image description that I put into the post.

I can take several days to describe one image. The resulting full, long image description can be longer than a hundred standard Mastodon toots for one single image, just so that it can provide all information that I think must be provided.

Obviously, this fails to satisfy everyone, even nearly everyone. I guess that for many Fediverse users, even my short descriptions in the alt-text are too long because they keep exceeding 1,000 characters. Even then, they're lacking. In particular, they're almost always lacking text transcripts because they don't have enough room.

The text transcripts are in the full, long, detailed description in the post. But many people can't even be bothered to open the content warning behind which the post is hidden, much less read tens of thousands of characters of image description or have them read to them.

And then I come across things like this blog post by @Robert Kingett, blind that says that AI image descriptions are generally vastly superior to human-written ones.

Apparently, AI is fully capable of actually perfectly satisfying absolutely everyone with image descriptions, no matter what kind of image has to be described, and no matter who the audience may be.

Apparently, the minimum requirements for image descriptions in the Fediverse have shifted. Halfway accurate descriptions aren't that much better than nothing anymore. They aren't good enough anymore. No matter what humans produce, it isn't good enough anymore.

Even if I spend two full days, sunrise to sunset, describing one single image in over 60,000 characters which I've actually done, the description isn't good enough. And I don't mean good enough in size. I also mean good enough in accuracy, level of detail and informativity.

No matter how niche and how obscure the topic of my images is, any AI out there can describe the same image in fewer characters, but at the same time in more details, with more information and even factually more accurately. And this is apparently the minimum level that counts as good enough.

Basically, my image descriptions only serve to satisfy Mastodon's fully sighted alt-text police and for me to have an edge over their quality requirements. At least until they decide that my descriptions aren't useful enough. From that point on, I'll be the only one out there who finds AI descriptions sub-par in comparison with my own ones.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #AIVsHuman #HumanVsAI #A11y #Accessibility
2024-08-26
@Ness
And I agree that community is preferable to AI. AI won't be able to interpret cultural nuance or in-jokes, either. Thanks again.

AI will always be limited in comparison with humans. In order for it to be perfect, every AI out there must be absolutely perfectly omniscient in even the most obscure niche topics possible.

To take my own images as an example: AI may be capable of identifying a virtual 3-D scene as such and tell it from a real-life photograph. But AI cannot tell whether it's a virtual world or a video game. And no AI out there can tell right off the bat with 100% certainty from any image thrown at it that the image was made in a world based on OpenSimulator. To be fair, very very very few humans can.

In order to replace me and my manual writing, every last AI out there would have to be able to tell from an image in which place it was rendered, on which sim (that has been launched only some three days ago or so), in which grid.

Oh, and no AI out there will ever be able to transcribe text that's a fraction of a pixel high. I can. For I don't read the text from the image, but from the original.

CC: @Robert Kingett, blind

#Long #LongPost #CWLong #CWLongPost #OpenSim #OpenSimulator #Metaverse #VirtualWorlds #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱 And while I'm at it, here's a quote-post of my comment in which I review the second AI description.

Jupiter Rowland wrote the following post Sat, 18 May 2024 00:24:46 +0200 It's almost hilarious how clueless the AI was again. And how wrong.

First of all, the roof isn't curved in the traditional sense. The end piece kind of is, but the roof behind it is more complex. Granted, unlike me, the AI can't look behind the roof end, so it doesn't know.

Next, the roof end isn't reflective. It isn't even glossy. And brushed stainless steel shouldn't really reflect anything.

The AI fails to count the columns that hold the roof end, and it claims they're evenly spaced. They're anything but.

There are three letters "M" on the emblem, but none of them is stand-alone.There is visible text on the logo that does provide additional context: "Universal Campus", "patefacio radix" and "MMXI". Maybe LLaVA would have been able to decipher at least the former, had I fed it the image at its original resolution of 2100x1400 pixels instead of the one I've uploaded with a resolution of 800x533 pixels. Decide for yourself which was or would have been cheating.

"Well-maintained lawn". Ha. The lawn is painted on, and the ground is so bumpy that I wouldn't call it well-maintained.

The entrance of the building is visible. In fact, three of the five entrances are. Four if you count the one that can be seen through the glass on the front. And the main entrance is marked with that huge structure around it.

The "few scattered clouds" are mostly one large cloud.

At least LLaVA is still capable of recognising a digital rendering and tells us how. Just you wait until PBR is out, LLaVA.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱 And since you obviously haven't actually read anything I've linked to, here's a quote-post of my comment in which I dissect the first AI description.

Jupiter Rowland wrote the following post Tue, 05 Mar 2024 20:28:12 +0100 (This is actually a comment. Find another post further up in this thread.)

Now let's pry LLaVA's image description apart, shall we?

The image appears to be a 3D rendering or a screenshot from a video game or a virtual environment.

Typical for an AI: It starts vague. That's because it isn't really sure what it's looking at.

This is not a video game. It's a 3-D virtual world.

At least, LLaVA didn't take this for a real-life photograph.

It shows a character

It's an avatar, not a character.

standing on a paved path with a brick-like texture.

This is the first time that the AI is accurate without being vague. However, there could be more details to this.

The character is facing away from the viewer,

And I can and do tell the audience in my own image description why my avatar is facing away from the viewer. Oh, and that it's the avatar of the creator of this picture, namely myself.

looking towards a sign or information board on the right side of the image.

Nope. Like the AI could see the eyeballs of my avatar from behind. The avatar is actually looking at the cliff in the background.

Also, it's clearly an advertising board.

The environment is forested with tall trees and a dense canopy, suggesting a natural, possibly park-like setting.

If I'm generous, I can let this pass as not exactly wrong. Only that there is no dense canopy, and this is not a park.

The lighting is subdued, with shadows cast by the trees, indicating either early morning or late afternoon.

Nope again. It's actually late morning. The AI doesn't know because it can't tell that the Sun is in the southeast, and because it has got no idea how tall the trees actually are, what with almost all treetops and half the shadow cast by the avatar being out of frame.

The overall atmosphere is calm and serene.

In a setting inspired by thrillers from the 1950s and 1960s. You're adorable, LLaVA. Then again, it was quiet because there was no other avatar present.

There's a whole lot in this image that LLaVA didn't mention at all. First of all, the most blatant shortcomings.

First of all, the colours. Or the lack of them. LLaVA doesn't say with a single world that everything is monochrome. What it's even less aware of is that the motive itself is monochrome, i.e. this whole virtual place is actually monochrome, and the avatar is monochrome, too.

Next, what does my avatar look like? Gender? Skin? Hair? Clothes?

Then there's that thing on the right. LLaVA doesn't even mention that this thing is there.

It doesn't mention the sign to the left, it doesn't mention the cliff at the end of the path, it doesn't mention the mountains in the background, and it's unaware of both the bit of sky near the top edge and the large building hidden behind the trees.

And it does not transcribe even one single bit of text in this image.

And now for what I think should really be in the description, but what no AI will ever be able to describe from looking at an image like this one.

A good image description should mention where an image was taken. AIs can currently only tell that when they're fed famous landmarks. AI won't be able to tell from looking at this image that it was taken at the central crossroads at Black White Castle, a sim in the OpenSim-based Pangea Grid anytime soon. And I'm not even talking about explaining OpenSim, grids and all that to people who don't know what it is.

Speaking of which, the object to the right. LLaVA completely ignores it. However, it should be able to not only correctly identify it as an OpenSimWorld beacon, but also describe what it looks like and explain to the reader what an OpenSimWorld beacon is, what OpenSimWorld is etc. because it should know that this can not be expected to be common knowledge. My own description does that in round about 5,000 characters.

And LLaVA should transcribe what's written on the touch screen which it should correctly identify as a touch screen. It should also mention the sign on the left and transcribe what's written on it.

In fact, all text anywhere within the borders of the picture should be transcribed 100% verbatim. Since there's no rule against transcribing text that's so small that it's illegible or that's so tiny that it's practically invisible or that's partially obscured or partially out of frame, a good AI should be capable of transcribing such text 100% verbatim in its entirety as well. Unless text is too small for me to read in-world, I can and do that.

And how about not only knowing that the advertising board is an advertising board, but also mentioning and describing what's on it? Technically speaking, there's actually a lot of text on that board, and in order to transcribe it, its context needs to be described. That is, I must admit I was sloppy myself and omitted a whole lot of transcriptions in my own description.

Still, AI has a very very long way to go. And it will never fully get there.

#Long #LongPost #CWLong #CWLongPost #AltText #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱
Without any context

The context matters. A whole lot.

A simple real-life cat photograph can be described in a few hundred characters, and everyone knows what it's all about. It doesn't need much visual description because it's mainly only the cat that matters. Just about everyone knows what real-life cats generally look like, except from the ways they differ from one another. Even people born 100% blind should have a rough enough idea what a cat is and what it looks like from a) being told it if they inquire and b) touching and petting a few cats.

Thus, most elements of a real-life cat photograph can safely be assumed to be common knowledge. They don't require description, and they don't require explanation because everyone should know what a cat is.

Now, let's take the image which LLaVA has described in 558 characters, and which I've previously described in 25,271 characters.

For one, it doesn't focus on anything. It shows an entire scene. If the visual description has to include what's important, it has to include everything in the image because everything in the image is important just the same.

Besides, it's a picture from a 3-D virtual world. Not from the real world. People don't know anything about this kind of 3-D virtual worlds in general, and they don't know anything about this place in particular. In this picture, nothing can safely be assumed to be common knowledge. For blind or visually-impaired users even less.

People may want to know where this image was made. AI won't be able to figure that out. AI can't examine that picture and immediately and with absolute certainty recognise that it was created on a sim called Black-White Castle on an OpenSim grid named Pangea Grid, especially seeing as that place was only a few days old when I was there. LLaVA wasn't even sure if it's a video game or a virtual world. So AI won't be able to tell people.

AI doesn't know either whether or not any of the location information can be considered common knowledge and therefore necessarily to explain so humans will understand it.

I, the human describer, on the other hand, can tell people where exactly this image was made. And I can explain it to them in such a way that they'll understand it with zero prior knowledge about the matter.

Next point: text transcripts. LLaVA didn't even notice that there is text in the image, much less transcribe it. Not transcribing every bit of text in an image is sloppy; not transcribing any text in an image is ableist.

No other AI will even be able to transcribe the text in this image, however. That's because no AI can read any of it. It's all too small and, on top of that, too low-contrast for reliable OCR. All that AI has is the image I've posted at a resolution of 800x533 pixels.

I myself can see the scenery at nigh-infinite resolution by going there. No AI can do that, and no LLM AI will ever be able to do that. And so I can read and transcribe all text in the image 100% verbatim with 100% accuracy.

However, text transcripts require some room in the description, also because they additionally require descriptions of where the text is.

I win again. And so does the long, detailed description.

Would you rather have alt text that is:

I'm not sure if this is typical Mastodon behaviour because it's impossible for Mastodon users to imagine that images can be described elsewhere than in the alt-text (they can, and I have), or if it's intentional trolling.

The 25,271 characters did not go into the alt-text! They went into the post.

I can put so many characters into a post. I'm not on Mastodon. I'm on Hubzilla which has never had and still doesn't have any character limits.

In the alt-text, there's a separate, shorter, still self-researched and hand-written image description to satisfy those who absolutely demand there be an image description in the alt-text.

25,271 characters in alt-text would cause Mastodon to cut 23,771 characters off and throw them away.

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-08-12
@Michal Bryxí 🌱
Prediction: Alt text will be generated by AI directly on the consumer's side so that *they* can tell what detail, information density, parts of the picture are important for *them*. And pre-written alt text will be frowned upon.

Won't happen.

Maybe AI sometimes happens to be as good as humans when it comes to describing generic, everyday images that are easy to describe. By the way, I keep seeing AI miserably failing to describe cat photos.

But when it comes to extremely obscure niche content, AI can only produce useless train wrecks. And this will never change. When it comes to extremely obscure niche content, AI not only requires full, super-detailed, up-to-date-by-the-minute knowledge of all aspects of the topic, down to niches within niches within the niche, but it must be able to explain it, and it must know that and inhowfar it's necessary to explain it.

I've pitted LLaVA against my own hand-written image descriptions. Twice. Not simply against the short image descriptions in my alt-texts, but against the full, long, detailed, explanatory image descriptions in the posts.

And LLaVA failed so, so miserably. What little it described, it often got it wrong. More importantly, LLaVA's descriptions were nowhere near explanatory enough for a casual audience with no prior knowledge in the topic to really understand the image.

500+ characters generated by LLaVA in five seconds are no match against my own 25,000+ characters that took me eight hours to research and write.

1,100+ characters generated by LLaVA in 30 seconds are no match against my own 60,000+ characters that took me two full days to research and write.

When I describe my images, I put abilities to use that AI will never have. Including, but not limited to the ability to join and navigate 3-D virtual worlds. Not to mention that an AI would have to be able to deduce from a picture where exactly a virtual world image was created, and how to get there.

So no, ChatGPT won't write circles around me by next year. Or ever. Neither will any other AI out there.

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI
2024-06-28
@Stefan Bohacek Well, I already do, and I guess you know by now.

At least I don't think my image descriptions are "basic". They may be "plain" and not "inspiring", but if "basic" with no drivel in-between already amounts to anything between 25,000 and over 60,000 characters, should I really decorate my image descriptions and inflate them further?

#AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #AIVsHuman #HumanVsAI
2024-06-28
@Jupiter Rowland And here is a quote-post of my review of the second image description by LLaVA.

Jupiter Rowland wrote the following post Sat, 18 May 2024 00:24:46 +0200 It's almost hilarious how clueless the AI was again. And how wrong.

First of all, the roof isn't curved in the traditional sense. The end piece kind of is, but the roof behind it is more complex. Granted, unlike me, the AI can't look behind the roof end, so it doesn't know.

Next, the roof end isn't reflective. It isn't even glossy. And brushed stainless steel shouldn't really reflect anything.

The AI fails to count the columns that hold the roof end, and it claims they're evenly spaced. They're anything but.

There are three letters "M" on the emblem, but none of them is stand-alone.There is visible text on the logo that does provide additional context: "Universal Campus", "patefacio radix" and "MMXI". Maybe LLaVA would have been able to decipher at least the former, had I fed it the image at its original resolution of 2100x1400 pixels instead of the one I've uploaded with a resolution of 800x533 pixels. Decide for yourself which was or would have been cheating.

"Well-maintained lawn". Ha. The lawn is painted on, and the ground is so bumpy that I wouldn't call it well-maintained.

The entrance of the building is visible. In fact, three of the five entrances are. Four if you count the one that can be seen through the glass on the front. And the main entrance is marked with that huge structure around it.

The "few scattered clouds" are mostly one large cloud.

At least LLaVA is still capable of recognising a digital rendering and tells us how. Just you wait until PBR is out, LLaVA.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

(3/3)

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA #AIVsHuman #HumanVsAI
2024-06-28
@Robert Kingett, blind For convenience, here is a quote-post of my review of the first image description by LLaVA.

Jupiter Rowland wrote the following post Tue, 05 Mar 2024 20:28:12 +0100 (This is actually a comment. Find another post further up in this thread.)

Now let's pry LLaVA's image description apart, shall we?

The image appears to be a 3D rendering or a screenshot from a video game or a virtual environment.

Typical for an AI: It starts vague. That's because it isn't really sure what it's looking at.

This is not a video game. It's a 3-D virtual world.

At least, LLaVA didn't take this for a real-life photograph.

It shows a character

It's an avatar, not a character.

standing on a paved path with a brick-like texture.

This is the first time that the AI is accurate without being vague. However, there could be more details to this.

The character is facing away from the viewer,

And I can and do tell the audience in my own image description why my avatar is facing away from the viewer. Oh, and that it's the avatar of the creator of this picture, namely myself.

looking towards a sign or information board on the right side of the image.

Nope. Like the AI could see the eyeballs of my avatar from behind. The avatar is actually looking at the cliff in the background.

Also, it's clearly an advertising board.

The environment is forested with tall trees and a dense canopy, suggesting a natural, possibly park-like setting.

If I'm generous, I can let this pass as not exactly wrong. Only that there is no dense canopy, and this is not a park.

The lighting is subdued, with shadows cast by the trees, indicating either early morning or late afternoon.

Nope again. It's actually late morning. The AI doesn't know because it can't tell that the Sun is in the southeast, and because it has got no idea how tall the trees actually are, what with almost all treetops and half the shadow cast by the avatar being out of frame.

The overall atmosphere is calm and serene.

In a setting inspired by thrillers from the 1950s and 1960s. You're adorable, LLaVA. Then again, it was quiet because there was no other avatar present.

There's a whole lot in this image that LLaVA didn't mention at all. First of all, the most blatant shortcomings.

First of all, the colours. Or the lack of them. LLaVA doesn't say with a single world that everything is monochrome. What it's even less aware of is that the motive itself is monochrome, i.e. this whole virtual place is actually monochrome, and the avatar is monochrome, too.

Next, what does my avatar look like? Gender? Skin? Hair? Clothes?

Then there's that thing on the right. LLaVA doesn't even mention that this thing is there.

It doesn't mention the sign to the left, it doesn't mention the cliff at the end of the path, it doesn't mention the mountains in the background, and it's unaware of both the bit of sky near the top edge and the large building hidden behind the trees.

And it does not transcribe even one single bit of text in this image.

And now for what I think should really be in the description, but what no AI will ever be able to describe from looking at an image like this one.

A good image description should mention where an image was taken. AIs can currently only tell that when they're fed famous landmarks. AI won't be able to tell from looking at this image that it was taken at the central crossroads at Black White Castle, a sim in the OpenSim-based Pangea Grid anytime soon. And I'm not even talking about explaining OpenSim, grids and all that to people who don't know what it is.

Speaking of which, the object to the right. LLaVA completely ignores it. However, it should be able to not only correctly identify it as an OpenSimWorld beacon, but also describe what it looks like and explain to the reader what an OpenSimWorld beacon is, what OpenSimWorld is etc. because it should know that this can not be expected to be common knowledge. My own description does that in round about 5,000 characters.

And LLaVA should transcribe what's written on the touch screen which it should correctly identify as a touch screen. It should also mention the sign on the left and transcribe what's written on it.

In fact, all text anywhere within the borders of the picture should be transcribed 100% verbatim. Since there's no rule against transcribing text that's so small that it's illegible or that's so tiny that it's practically invisible or that's partially obscured or partially out of frame, a good AI should be capable of transcribing such text 100% verbatim in its entirety as well. Unless text is too small for me to read in-world, I can and do that.

And how about not only knowing that the advertising board is an advertising board, but also mentioning and describing what's on it? Technically speaking, there's actually a lot of text on that board, and in order to transcribe it, its context needs to be described. That is, I must admit I was sloppy myself and omitted a whole lot of transcriptions in my own description.

Still, AI has a very very long way to go. And it will never fully get there.

#Long #LongPost #CWLong #CWLongPost #AltText #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

(2/3)

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA #AIVsHuman #HumanVsAI
2024-06-28
@Robert Kingett, blind I might be an extremely rare exception, and I probably am. But I seem to be one of the very few human image describers who beat any AI out there not only in accuracy, but also in level of details and informativity.

Granted, that's easy to do for me. For one, the AI only has the image whereas I can examine the real deal from up close. Besides, describing and explaining my images accurately require extreme niche knowledge, sometimes up-to-date by mere days. No AI has this knowledge.

Oh, and by the way, I have actually let LLaVA describe two of my images which I had manually described first. I've posted the AI description separately and then reviewed it.

First image:
  • My original post, incl. short visual description in alt-text (382 characters of description, 920 characters altogether) and full and detailed description in the post (25,271 characters)
  • The image posted again, incl. my own short visual description in alt-text (382 characters of description, 1,064 characters altogether), my own full and detailed description in the post (still 25,271 characters) and a description generated by LLaVA (558 characters).
  • In the same thread as the link above, my detailed review of LLaVA's description, pointing out the mistakes it has made

Second image:

I guess it should be clear that no AI can do in 30 seconds what took me up to two days.

(1/3)

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA #AIVsHuman #HumanVsAI

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst