RE: [EXTERNAL] Natural language interfaces and conversational agents

As an addendum to my analysis from earlier today: the following question has arisen in other work to which I’ve contributed in the past, but on which I don’t have a well informed answer.
Assuming that a natural language interface supports both speech input/output and text input/output, how important is it for the user to be able to switch between these modes during an interaction, rather than deciding on one or the other at the outset of the interaction and not having the option to alter this choice until the interactive session has ended? For example, suppose the user can either interact with the system textually via a Web page in a manner similar to an instant messaging system, or activate a button that starts a WebRTC voice session, but cannot switch from one to the other until the a new interactive session is started. To what extent would this be an accessibility limitation?

From: White, Jason J <jjwhite@ets.org>
Sent: Wednesday, 3 March 2021 12:21
To: public-rqtf@w3.org
Subject: RE: [EXTERNAL] Natural language interfaces and conversational agents

At the meeting today, it was agreed we should attempt a preliminary classification of the issues that should be addressed within the scope of this topic. Based on the conversations that have taken place so far, and after reflecting on the matter, here is my first approximation.

Sensory issues: the need to support multiple output modalities for the natural language interface (visual, auditory, braille/tactile), either directly or via assistive technologies. Whether a generic text input/output interface in the style of an IRC client or instant messaging application would suffice to satisfy these requirements, given the availability of assistive technologies. Whether AAC symbols or sign language could be used for output – possibly infeasible in the short term due to the unsolved research problems involved, at least for sign languages.

For visual output of the natural language interaction: what the user should be able to control (e.g., font size, text spacing, and other style properties of displayed text).

For spoken output of the natural language interaction: what the user should be able to control (e.g., speech rate, volume, choice of voices, etc.).

For graphical output generated by the system that is not part of the natural language interaction (e.g., maps, interactive Web pages, etc., displayed by the application in response to the user’s request) – we should probably refer to existing guidelines and indicate that only the natural language interaction itself is within scope here. This seems on first analysis to be a reasonable scope boundary. Also, if the natural language interface is part of a telephony application or similar service, perhaps RAUR could be referred to as well.

Input issues: support for multiple input modes (keyboard, switch, eye tracking, speech, etc.), either directly or via assistive technologies. Whether an IRC/instant messaging-style interaction is sufficient to satisfy these requirements, given the availability of assistive technologies. Whether sign language input or AAC symbol input can be supported, given the current state of technology (possibly different answers depending on the circumstances).

For speech input: accurate recognition of speakers who have different speech characteristics (e.g., due to having a disability). How the system should respond when low confidence in the speech recognition is detected (e.g., by prompting for information to be repeated or asking the user for confirmation).

For multimodal systems that support digital pen input or other forms of graphical input (e.g., for working with diagrams or for handwriting recognition), support for recognizing input provided by people with motor-related disabilities would be important, and this doesn’t seem to be addressed elsewhere in W3C guidance. Some systems, for example, offer a combination of speech input and pen input. On the other hand, we could argue that since the pen input isn’t strictly part of the natural language processing, it’s out of scope for purposes of the present project.

For text input: perhaps some error-handling issues (e.g., spelling errors) should be discussed). What else should be addressed here?

Cognitive: issues of discoverability – how the user knows what sentences/utterances the system will accept at any point during the interaction. Availability of help information. Inclusion of hints/prompts/suggestions in the system’s output to assist the user in knowing what can be done next. The use of menus of options to guide the user’s decisions during an interactive session.

Cognitive: reminding the user of the context and of previously provided information. We need more analysis of the requirements here. The ability for the user to request that information be repeated would also assist with memory-related issues, especially if speech output is used and the interaction is not displayed visually. For visual output, scroll-back support so that the user can review the entire conversation/interaction would seem useful. Even if speech input and output are used, a textually displayed log of the conversation could still be beneficial (e.g., presented on screen or via a braille device). The log should clearly distinguish the user’s input from the system’s output.

Cognitive: access to glossary definitions and explanations at the user’s request. The option for the user to request spelling of names or other words if speech output is used would also be helpful.

Cognitive: the option for the user to request reminders of upcoming events relevant to the system’s operation (e.g., calendar appointments). Reminders and alerts would need to be multimodal (e.g., auditory, visual, vibratory/haptic) as well.

Cognitive: support for configuring the system to provide simpler language, perhaps an interface with fewer options/capabilities which is restricted to only the features that the particular user needs.

Cognitive: support for a variety of vocabulary and a variety of ways of issuing the same request or providing the same information – that is, flexibility in handling a wide diversity of natural language sentences/utterances that users may give as input to the system. The ability to handle repeated information.

Cognitive: the ability for the user to correct errors, and how this should be supported – more work is obviously needed here.

Cognitive: keeping track of the context of a conversation as a dialogue with the user progresses and of previously supplied information. This is a research problem in natural language processing, and it isn’t clear what the accessibility requirements should be here. Users are likely to expect to be able to depend on or refer to aspects of the context as an interaction progresses, and this may be especially important for those with learning or cognitive disabilities.

User identification and authentication: how the system can ascertain who is interacting with it. Speaker identification may be feasible if speech input is used. The authentication features of the underlying platform/operating system (e.g., biometrics other than voice) would presumably need to be supported as well, so that there are multiple mechanisms of authentication available. If the system is accessed via a Web page, then presumably the standard Web-based authentication mechanisms can be used; but there are more issues for stand-alone hardware devices or mobile applications in providing accessible authentication methods.

Relationship with hardware capabilities: natural language-based interfaces can occur in a variety of contexts – as stand-alone hardware devices such as “smart speakers” and consumer appliances, as applications running on mobile phones and tablets, in wearable computing devices, on desktop and laptop systems, as components of Web pages/applications, via telephony/RTC-based applications, etc. The same system may be available via multiple means (e.g., in dedicated hardware or via the Web according to the user’s preference). Different modalities and different accessibility features may be available depending on the platform used to interact with the natural language interface. Should we say that the accessibility requirements apply to the software system/natural language interface, but that they will be supported in different ways and to a different extent depending on the platform?

Note that many of the foregoing issues are modality-independent, and that cognitive considerations have a large role. Further, restricting the scope of the work to the natural language interaction itself – citing other sources of guidance concerning the accessibility of other aspects of the over-all system – seems reasonable in order to keep the requirement gathering effort suitably confined.

What issues have I missed?
How reasonable is the scope?
Corrections, refinements, and objections are all welcome.

Regards,

Jason.


________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.


Thank you for your compliance.

________________________________

________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.


Thank you for your compliance.

________________________________

Received on Wednesday, 3 March 2021 20:37:23 UTC