Comment: WCAG Accessibility Guidelines - Accessing a "virtual" meeting using a teleconferencing application such as Zoom - generating accurate text content from Rod Macdonald on 2021-02-25 (public-agwg-comments@w3.org from March 2021)

From: Rod Macdonald <rjmacdonald@hawaiiantel.net>
Date: Wed, 24 Feb 2021 20:31:06 -1000
To: public-agwg-comments@w3.org
To: Accessibility Guidelines Working Group
Cc: rjmacdonald@hawaiiantel.net
Message-ID: <20210225063106.rjmacdonald@hawaiiantel.net>

From: Rod Macdonald (rjmacdonald@hawaiiantel.net)

Re: W3C Accessibility Guidelines (WCAG) 3.0 - W3C First Public Working Draft 21 January 2021

Issue: Accessing a "virtual" meeting using a teleconferencing application such as Zoom - generating accurate text content

Background: For purposes of this discussion, individuals who are Deaf-Blind can be grouped into four sub-groups:

(1) The Deaf-Blind individual retains sufficient residual hearing, with amplification and/or other enhancements, to access web content in the of a hearing person.

(2) The Deaf-Blind individual retains sufficient residual vision, with screen magnification and/or other enhancements, to access web content in the manner of a sighted person.

(3) The Deaf-Blind individual cannot access web content via speech or hearing, but can do so using braille.

(4) The Deaf-Blind individual cannot access web content using vision, hearing or braille, and thus cannot access web content at all. (There may be extremely rare cases when the use of unusual technology may circumvent this.)

This discussion refers exclusively to Deaf-Blind individuals in the third group - braille users.

Problem: At present there are two ways in which a person can access virtual meeting content, absent the ability to hear the discussion or follow the proceedings via visual means, such as provided sign language interpreter: streamed text or a subsequent text transcript. This issue deals with creation of the text to be streamed or encapsulated in a transcript.

A "live" captioner, using technology similar to that of a court reporter, can and does include additional information to make the text more meaningful. In my experience, however, some practitioners of this art are not up to the task, and standards would be helpful.

Automated Speech-to-text applications are a work in progress and need considerable improvement in order to be truly accessible. generally, these apps focus on generating an accurate text reproduction of the words spoken. Significant progress has been made in this area. The problem lies in capturing the context:

(1) Who is speaking?

A live captioner usually provides speaker identification. Captioned text frequently is displayed on-screen in the vicinity of the speaker, making it clear who is speaking - if you can see the screen.. A hearing person can usually identify the speaker from their voice. Automated speech-to-text Streamed text usually is in the form of a continuous moving line of text at the bottom of the screen, and knowing who is speaking can be problematic. Observing meeting participants on-screen can help. But for a person relying entirely on a stream of text, it is often impossible to know the identity of the speaker, leading to erroneous assumptions and incorrect understanding of an individual's spoken comments and beliefs. AI should be able to assign speaker identity in some manner - if not the speaker's actual name, then at least the speaker's virtual identity, such as a sequence number or individuals speaking or similar.

(2) When is there a change of speaker?

Along with the issues raised above, it is very important to one's understanding of meeting content to know when there is a change of speaker. A solution would be for the speech-to-text application to at least insert a hard line break when a new speaker emerges.

(3) What is the meaning of the words spoken?

Consider this: The words "WOMAN WITHOUT HER MAN IS A SAVAGE" are spoken.

Do you mean "Woman, without her man, is a savage."?

Or perhaps "Woman: Without her, man is a savage."?

Or even "Woman without! Her man is a savage!!!"?

The meaning of the words spoken are conveyed via cadence, tone of voice, pitch, pauses in speech, facial expression, body language. Capturing the true meaning of speech translated to text is a major issue.

These issues need to be addressed in order to make speech-to-text applications truly accessible.

Received on Monday, 1 March 2021 14:42:29 UTC