RE: [EXTERNAL] Natural language interfaces and conversational agents from John Paton on 2021-03-04 (public-rqtf@w3.org from March 2021)

From: John Paton <John.Paton@rnib.org.uk>
Date: Thu, 4 Mar 2021 10:05:01 +0000
To: "White, Jason J" <jjwhite@ets.org>, "public-rqtf@w3.org" <public-rqtf@w3.org>
Message-ID: <CWXP265MB3541196A42DF48D13EDA4F54C0979@CWXP265MB3541.GBRP265.PROD.OUTLOOK.COM>
Hi Jason,

I think either Shadi or Michael suggested that the scope could be written out with separate sub domains so I’ve had a go at doing that with your points below. Once populated it may then help carve up the work into manageable chunks. By setting out the overall scope of the work (and then announcing the segment of the scope the group intends to work on first) that could also set the conceptual anchor that Judy suggested whilst not committing us to working on everything at once. Working on a sub-segment may be a challenge to get the crisp scope that Michael petitioned for but I think it should be possible.

It also highlights that a large number of the points raised so far are cognitive which would suggest that we need input from COGA.

One point from the meeting was that voice agents are already a thing in the wild that need accessibility guidance. I understand that they will fail the general accessibility requirements of the W3C but if we refuse to work on the specific accessibility requirements of voice interaction then we are not helping advance the accessibility of those devices. Whether it is the first area we work on or whether we later drill down I think it’s important that at some point we address accessibility issues relating to a solely speech based interaction. Otherwise I won’t be doing my job as an advocate of blind and partially sighted people. Braille is used by a minority of people with sight loss and so it is a nonsense to say that a purely speech based interaction is inaccessible. For some of the people I represent it is the accessibility ideal. I guess there is a difference between specific accessibility (does it work for a person with particular needs) and general accessibility (can it be used by everyone).

I’ve mainly just clumped Jason’s words together so a sense check to ensure they are in the right place and still make sense in a new context would be welcome.

Smart Agents

  *   Sensory issues: the need to support multiple output modalities for the natural language interface (visual, auditory, braille/tactile), either directly or via assistive technologies. Whether a generic text input/output interface in the style of an IRC client or instant messaging application would suffice to satisfy these requirements, given the availability of assistive technologies..
  *   User identification and authentication: how the system can ascertain who is interacting with it. Speaker identification may be feasible if speech input is used. The authentication features of the underlying platform/operating system (e.g., biometrics other than voice) would presumably need to be supported as well, so that there are multiple mechanisms of authentication available. If the system is accessed via a Web page, then presumably the standard Web-based authentication mechanisms can be used; but there are more issues for stand-alone hardware devices or mobile applications in providing accessible authentication methods.

Voice interaction

  *   For spoken output of the natural language interaction: what the user should be able to control (e.g., speech rate, volume, choice of voices, etc.).
  *   For speech input: accurate recognition of speakers who have different speech characteristics (e.g., due to having a disability). How the system should respond when low confidence in the speech recognition is detected (e.g., by prompting for information to be repeated or asking the user for confirmation).

Text Interaction

  *   For visual output of the natural language interaction: what the user should be able to control (e.g., font size, text spacing, and other style properties of displayed text).
  *   Input issues: support for multiple input modes (keyboard, switch, eye tracking, speech, etc.), either directly or via assistive technologies. Whether an IRC/instant messaging-style interaction is sufficient to satisfy these requirements, given the availability of assistive technologies
  *   For text input: perhaps some error-handling issues (e.g., spelling errors) should be discussed). What else should be addressed here?
  *   Where grammar and word order follow common but non-standard patterns (such as English text in a British Sign Language word order) should these be options?
  *   Cognitive: For visual output, scroll-back support so that the user can review the entire conversation/interaction would seem useful. Even if speech input and output are used, a textually displayed log of the conversation could still be beneficial (e.g., presented on screen or via a braille device). The log should clearly distinguish the user’s input from the system’s output.

Graphical interfaces and other

  *   For graphical output generated by the system that is not part of the natural language interaction (e.g., maps, interactive Web pages, etc., displayed by the application in response to the user’s request) – we should probably refer to existing guidelines and indicate that only the natural language interaction itself is within scope here. This seems on first analysis to be a reasonable scope boundary. Also, if the natural language interface is part of a telephony application or similar service, perhaps RAUR could be referred to as well.
  *   Whether sign language input or AAC symbol input can be supported, given the current state of technology (possibly different answers depending on the circumstances).
  *   For multimodal systems that support digital pen input or other forms of graphical input (e.g., for working with diagrams or for handwriting recognition), support for recognizing input provided by people with motor-related disabilities would be important, and this doesn’t seem to be addressed elsewhere in W3C guidance. Some systems, for example, offer a combination of speech input and pen input. On the other hand, we could argue that since the pen input isn’t strictly part of the natural language processing, it’s out of scope for purposes of the present project.
  *   Whether AAC symbols or sign language could be used for output – possibly infeasible in the short term due to the unsolved research problems involved, at least for sign languages

Cognitive issues

  *   Cognitive: issues of discoverability – how the user knows what sentences/utterances the system will accept at any point during the interaction. Availability of help information. Inclusion of hints/prompts/suggestions in the system’s output to assist the user in knowing what can be done next. The use of menus of options to guide the user’s decisions during an interactive session.
  *   Cognitive: reminding the user of the context and of previously provided information. We need more analysis of the requirements here. The ability for the user to request that information be repeated would also assist with memory-related issues, especially if speech output is used and the interaction is not displayed visually.
  *   Cognitive: access to glossary definitions and explanations at the user’s request. The option for the user to request spelling of names or other words if speech output is used would also be helpful.
  *   Cognitive: the option for the user to request reminders of upcoming events relevant to the system’s operation (e.g., calendar appointments). Reminders and alerts would need to be multimodal (e.g., auditory, visual, vibratory/haptic) as well.
  *   Cognitive: support for configuring the system to provide simpler language, perhaps an interface with fewer options/capabilities which is restricted to only the features that the particular user needs.
  *   Cognitive: support for a variety of vocabulary and a variety of ways of issuing the same request or providing the same information – that is, flexibility in handling a wide diversity of natural language sentences/utterances that users may give as input to the system. The ability to handle repeated information.
  *   Cognitive: the ability for the user to correct errors, and how this should be supported – more work is obviously needed here.
  *   Cognitive: keeping track of the context of a conversation as a dialogue with the user progresses and of previously supplied information. This is a research problem in natural language processing, and it isn’t clear what the accessibility requirements should be here. Users are likely to expect to be able to depend on or refer to aspects of the context as an interaction progresses, and this may be especially important for those with learning or cognitive disabilities.



Notes
Relationship with hardware capabilities: natural language-based interfaces can occur in a variety of contexts – as stand-alone hardware devices such as “smart speakers” and consumer appliances, as applications running on mobile phones and tablets, in wearable computing devices, on desktop and laptop systems, as components of Web pages/applications, via telephony/RTC-based applications, etc. The same system may be available via multiple means (e.g., in dedicated hardware or via the Web according to the user’s preference). Different modalities and different accessibility features may be available depending on the platform used to interact with the natural language interface. Should we say that the accessibility requirements apply to the software system/natural language interface, but that they will be supported in different ways and to a different extent depending on the platform?

Note that many of the foregoing issues are modality-independent, and that cognitive considerations have a large role. Further, restricting the scope of the work to the natural language interaction itself – citing other sources of guidance concerning the accessibility of other aspects of the over-all system – seems reasonable in order to keep the requirement gathering effort suitably confined.


Thanks,

John

From: White, Jason J <jjwhite@ets.org>
Sent: 03 March 2021 17:21
To: public-rqtf@w3.org
Subject: RE: [EXTERNAL] Natural language interfaces and conversational agents

CAUTION: External. Do not click links or open attachments unless you know the content is safe.
________________________________
At the meeting today, it was agreed we should attempt a preliminary classification of the issues that should be addressed within the scope of this topic. Based on the conversations that have taken place so far, and after reflecting on the matter, here is my first approximation.

Sensory issues: the need to support multiple output modalities for the natural language interface (visual, auditory, braille/tactile), either directly or via assistive technologies. Whether a generic text input/output interface in the style of an IRC client or instant messaging application would suffice to satisfy these requirements, given the availability of assistive technologies. Whether AAC symbols or sign language could be used for output – possibly infeasible in the short term due to the unsolved research problems involved, at least for sign languages.

For visual output of the natural language interaction: what the user should be able to control (e.g., font size, text spacing, and other style properties of displayed text).

For spoken output of the natural language interaction: what the user should be able to control (e.g., speech rate, volume, choice of voices, etc.).

For graphical output generated by the system that is not part of the natural language interaction (e.g., maps, interactive Web pages, etc., displayed by the application in response to the user’s request) – we should probably refer to existing guidelines and indicate that only the natural language interaction itself is within scope here. This seems on first analysis to be a reasonable scope boundary. Also, if the natural language interface is part of a telephony application or similar service, perhaps RAUR could be referred to as well.

Input issues: support for multiple input modes (keyboard, switch, eye tracking, speech, etc.), either directly or via assistive technologies. Whether an IRC/instant messaging-style interaction is sufficient to satisfy these requirements, given the availability of assistive technologies. Whether sign language input or AAC symbol input can be supported, given the current state of technology (possibly different answers depending on the circumstances).

For speech input: accurate recognition of speakers who have different speech characteristics (e.g., due to having a disability). How the system should respond when low confidence in the speech recognition is detected (e.g., by prompting for information to be repeated or asking the user for confirmation).

For multimodal systems that support digital pen input or other forms of graphical input (e.g., for working with diagrams or for handwriting recognition), support for recognizing input provided by people with motor-related disabilities would be important, and this doesn’t seem to be addressed elsewhere in W3C guidance. Some systems, for example, offer a combination of speech input and pen input. On the other hand, we could argue that since the pen input isn’t strictly part of the natural language processing, it’s out of scope for purposes of the present project.

For text input: perhaps some error-handling issues (e.g., spelling errors) should be discussed). What else should be addressed here?

Cognitive: issues of discoverability – how the user knows what sentences/utterances the system will accept at any point during the interaction. Availability of help information. Inclusion of hints/prompts/suggestions in the system’s output to assist the user in knowing what can be done next. The use of menus of options to guide the user’s decisions during an interactive session.

Cognitive: reminding the user of the context and of previously provided information. We need more analysis of the requirements here. The ability for the user to request that information be repeated would also assist with memory-related issues, especially if speech output is used and the interaction is not displayed visually. For visual output, scroll-back support so that the user can review the entire conversation/interaction would seem useful. Even if speech input and output are used, a textually displayed log of the conversation could still be beneficial (e.g., presented on screen or via a braille device). The log should clearly distinguish the user’s input from the system’s output.

Cognitive: access to glossary definitions and explanations at the user’s request. The option for the user to request spelling of names or other words if speech output is used would also be helpful.

Cognitive: the option for the user to request reminders of upcoming events relevant to the system’s operation (e.g., calendar appointments). Reminders and alerts would need to be multimodal (e.g., auditory, visual, vibratory/haptic) as well.

Cognitive: support for configuring the system to provide simpler language, perhaps an interface with fewer options/capabilities which is restricted to only the features that the particular user needs.

Cognitive: support for a variety of vocabulary and a variety of ways of issuing the same request or providing the same information – that is, flexibility in handling a wide diversity of natural language sentences/utterances that users may give as input to the system. The ability to handle repeated information.

Cognitive: the ability for the user to correct errors, and how this should be supported – more work is obviously needed here.

Cognitive: keeping track of the context of a conversation as a dialogue with the user progresses and of previously supplied information. This is a research problem in natural language processing, and it isn’t clear what the accessibility requirements should be here. Users are likely to expect to be able to depend on or refer to aspects of the context as an interaction progresses, and this may be especially important for those with learning or cognitive disabilities.

User identification and authentication: how the system can ascertain who is interacting with it. Speaker identification may be feasible if speech input is used. The authentication features of the underlying platform/operating system (e.g., biometrics other than voice) would presumably need to be supported as well, so that there are multiple mechanisms of authentication available. If the system is accessed via a Web page, then presumably the standard Web-based authentication mechanisms can be used; but there are more issues for stand-alone hardware devices or mobile applications in providing accessible authentication methods.

Relationship with hardware capabilities: natural language-based interfaces can occur in a variety of contexts – as stand-alone hardware devices such as “smart speakers” and consumer appliances, as applications running on mobile phones and tablets, in wearable computing devices, on desktop and laptop systems, as components of Web pages/applications, via telephony/RTC-based applications, etc. The same system may be available via multiple means (e.g., in dedicated hardware or via the Web according to the user’s preference). Different modalities and different accessibility features may be available depending on the platform used to interact with the natural language interface. Should we say that the accessibility requirements apply to the software system/natural language interface, but that they will be supported in different ways and to a different extent depending on the platform?

Note that many of the foregoing issues are modality-independent, and that cognitive considerations have a large role. Further, restricting the scope of the work to the natural language interaction itself – citing other sources of guidance concerning the accessibility of other aspects of the over-all system – seems reasonable in order to keep the requirement gathering effort suitably confined.

What issues have I missed?
How reasonable is the scope?
Corrections, refinements, and objections are all welcome.

Regards,

Jason.


________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.


Thank you for your compliance.

________________________________


--

Every day, 250 people in the UK begin to lose their sight, that’s why we need you to Take on 250 for RNIB. Walking, running, cycling or swimming; baking, singing, dancing or knitting. It’s all up for grabs – and you complete 250 of whatever you decide. Join us and make a difference for people facing sight loss.
Join us at https://www.rnib.org.uk/donations-and-fundraising/challenge-events/take-250-rnib and make a difference for people facing sight loss.

--


DISCLAIMER:

NOTICE: The information contained in this email and any attachments is confidential and may be privileged.  If you are not the intended recipient you should not use, disclose, distribute or copy any of the content of it or of any attachment; you are requested to notify the sender immediately of your receipt of the email and then to delete it and any attachments from your system.

RNIB endeavours to ensure that emails and any attachments generated by its staff are free from viruses or other contaminants.  However, it cannot accept any responsibility for any  such which are transmitted.

We therefore recommend you scan all attachments.

Please note that the statements and views expressed in this email and any attachments are those of the author and do not necessarily represent those of RNIB.

RNIB Registered Charity Number: 226227

Website: https://www.rnib.org.uk
Received on Thursday, 4 March 2021 10:05:24 UTC