Re: AI alt text, Google Gemini and TalkBack from Jason J.G. White on 2024-10-29 (public-rqtf@w3.org from October 2024)

From: Jason J.G. White <jason@jasonjgw.net>
Date: Tue, 29 Oct 2024 12:47:59 -0400
To: public-rqtf@w3.org
Message-ID: <f4f23a34-9cbd-4328-b54e-06158e76fcf7@jasonjgw.net>

On 29/10/24 05:14, Scott Hollier wrote:
> Continuing the discussion on AI,  here at the Centre we’ve been 
> testing the Android 15 update that now embeds Google Gemini alt text 
> assessment into the TalkBack screen reader. I think this may be the 
> first time AI alt text has been directly built into a screen reader, 
> so it’s been of considerable interest for us.

Note that Apple has had image recognition in its screen readers for 
several years, but it's on-device machine learning, which doesn't run 
the most advanced models yet.

Vispero announced image description capabilities using multiple large 
language models earlier this year, which are included in their JAWS 
screen reader. For NVDA, I think the capability requires an add-on to be 
installed. Under Linux, if I recall correctly, someone implemented this 
as a separate tool - I wasn't paying attention to the details.

I think Vispero's approach is interesting, in that it lets you access 
multiple descriptions created by different models. I don't know to what 
extent this is useful in detecting errors - it's presumably better as 
long as they don't all make the same mistake.

I also expect on-device recognition to become more popular for privacy 
reasons and as the local hardware is upgraded to be able to run larger 
models. The models themselves may become more efficient over time as 
well, but that's just my speculation; I don't have any background in the 
mathematics of neural networks.

Whatever we say about this will need to be generic and model-neutral.

On a related topic, there were claims attracting media atention recently 
to the effect that a speech recognition model could generate erroneous 
text, including completely fabricated sentences, so one should be 
careful in using them for captions or transcripts (as we already well knew).

Received on Tuesday, 29 October 2024 16:48:04 UTC