RE: AI alt text, Google Gemini and TalkBack

To Jason 

Thanks for that. Could you point me in the direction of VoiceOver on iPhone using machine learning generated alt text?  Before posting my team and I were looking to confirm if this was available and could not' find any documentation, and Google claims this is the first mobile screen reader integration where its possible to choose between alt text that's already there and generative IAI so would be good to get a better understanding of the Voiceover option.  Id also be interested to know if the other platforms have different AI generated alt text each time the same image is checked like Google Gemini as the accuracy and consistency of AI generated alt text is something I think is important to our Note. 

Thank you 

Scott. 


Thanks for that. 

Dr Scott Hollier    
Chief Executive Officer   
  
Centre For Accessibility Australia Ltd. 
Phone: +61 (0)430 351 909
Email: scott.hollier@accessibility.org.au 
Address: Suite 5, Belmont Hub, 213 Wright Street, Cloverdale WA 6105 
accessibility.org.au 
Subscribe to our newsletter

  
 
CFA Australia respectfully acknowledges the Traditional Owners of Country across Australia and pay our respects to Elders past and present.   

-----Original Message-----
From: Jason J.G. White <jason@jasonjgw.net> 
Sent: Wednesday, 30 October 2024 12:48 AM
To: public-rqtf@w3.org
Subject: Re: AI alt text, Google Gemini and TalkBack


On 29/10/24 05:14, Scott Hollier wrote:
> Continuing the discussion on AI,  here at the Centre we’ve been 
> testing the Android 15 update that now embeds Google Gemini alt text 
> assessment into the TalkBack screen reader. I think this may be the 
> first time AI alt text has been directly built into a screen reader, 
> so it’s been of considerable interest for us.

Note that Apple has had image recognition in its screen readers for several years, but it's on-device machine learning, which doesn't run the most advanced models yet.

Vispero announced image description capabilities using multiple large language models earlier this year, which are included in their JAWS screen reader. For NVDA, I think the capability requires an add-on to be installed. Under Linux, if I recall correctly, someone implemented this as a separate tool - I wasn't paying attention to the details.

I think Vispero's approach is interesting, in that it lets you access multiple descriptions created by different models. I don't know to what extent this is useful in detecting errors - it's presumably better as long as they don't all make the same mistake.

I also expect on-device recognition to become more popular for privacy reasons and as the local hardware is upgraded to be able to run larger models. The models themselves may become more efficient over time as well, but that's just my speculation; I don't have any background in the mathematics of neural networks.

Whatever we say about this will need to be generic and model-neutral.

On a related topic, there were claims attracting media atention recently to the effect that a speech recognition model could generate erroneous text, including completely fabricated sentences, so one should be careful in using them for captions or transcripts (as we already well knew).

Received on Wednesday, 30 October 2024 00:59:59 UTC