Executive Summary
Introduction
Peril and Potential
The W3C Multi-Modal
Interaction Architecture and
Interfaces is a platform for Web Application delivery. The
accessibility prospects for applications delivered over this platform
resembles
the construction of the Chinese character for 'crisis': it combines
danger and opportunity.
- There
is a problem:
- Without strengthening the flow-down of
access-supportive requirements through the MMI Architecture, many MMI
applications will be developed in needlessly inaccessible form.
- There
is rich promise:
- Separating interaction management from input
device handling makes it natural to deal in intent-based-events (but
not necessary).
- A well-crafted platform for integrating
multi-modal interaction can make it easier to integrate assistive
delivery contexts with atypical mode mixes that run standard
applications effectively, using mass-market constituents with
a few small assistive-peculiar constituents.
- Adaptation strategies
learned in service to people with disabililities will make the authoring of
multi-modal interaction applications significanly more portable across
delivery contexts, accelerating the availability of multi-modal interaction
adavantages in the delivery of applications and the more widespread use of
the multi-modal interaction technology.
Requirement and response
Despite the advances in server-side adaptation of
the user experience,
there is a continuing requirement for MMI application authors to leave
in public interfaces enough of a record so that the interaction profile
can be
changed after the author has left the scene without loss of function.
This
can be fostered in the architecture by being careful to articulate
requirements
on the public intefaces of the separated constituents handling the
different
aspects of the overall application-delivery system.
- [requirement on capability] aftermarket
adaptation is an accessibility requirement:
The mode-use profiles of disability-adaptive delivery
contexts are too varied and infrequent for the application designer to
design for all or enough of them. Multi-Modal Applications
need to be designed 'universally' or 'for device independence' enough
so that automatic adaptation to un-designed-for delivery contexts
affords a 'functional user experience.' The framework, the
Multimodal Interaction Architecture and Interfaces, needs to do
everything readily achievable to bias the application development
environment toward the achievement of this goal.
- [strategy - product] distributed decision-making,
not just distributed computation.
The architecture currently paints the interaction manager
constituent as a knows-all, decides-all actor. There is advantage
to be gained by late binding to representation decisions, where
possible. This calls for a more in-depth development of the
separation of concerns or aspects of interaction and adaptation.
- [strategy - product] public
views capture the balance point between economical
information hiding and effective mix-and-match interoperation.
Platform architectures intergrate gray boxes, not black
boxes. The MMI Architecture needs to allocate functional
requirements for the public views exported by the sub-systems
identified.
- [strategy - process] continual collaboration:
Multi-Modal Interaction prototypes should be alpha-tested for
adaptability before they are beta tested in just one targeted delivery
context. This will drive the continuing re-factorization of
the aspects of interaction, data, and input/output mediation.
Next Steps
- Read this document
- Invite WAI observers to MMI F2F and discuss with them then
- Set up joint team with UWA, IMS, ISO and WAI to address personal preferences in the Delivery Context constituent
- Set up joint team with SYMM and WAI to work on "management of channel diversity."
- Monitor Voice/Forms progress as regards common foundations at the level of data models and event models.
- Monitor
UWA progress on Intent Based Events and prepare to generalize to a
framework for intent-based expression of gesture-catches as expressed
in EMMA.
Use
Cases
Threshold: achieve functional user experience
transcoding commercial applications to support conventional access
strategies.
GUI-like applications:
There
will be a class of applications targeted to delivey context classes
with large enough graphic displays and some sort of keyset where the
accessibility strategy is the same as would be for a "point and click"
oriented GUI applications. Here, anything you can do with any
modality gesture can be accomplished with some sequence of key actions;
and anything that is displayed to the user can be displayed by a
textual representation. Sometimes the keystrokes take you on a
slightly different path through the application-state possibilities,
and sometimes the text is a separate parallel object to the default
presented object, but the task-level end goals reachable are the same
and the sense of the media objects are conveyed in the text
alternatives.Dialog-oriented applications:
In addition to
applications that are close enough to GUI applications in their
organization to be access-enabled in the same ways, there will also be
a radically different kind of applications. These are
applications which closely resemble VoiceXML applications, sometimes
termed voice-enabled Interactive Voice Response applications.
These are cases where the application can be completely operated
by listening and speaking; but alternate modalities are also enabled
for episodes in the dialog where the available additional mode
capabilities of the delivery context afford greater speed or lower
error rates.
These applications suggest a different strategy
for access: The application is based in and SCXML or other
interaction-management document, and the interface to input
mode-adaptation is in terms of menus of choices, not strings or key
events.Mild tremor and touch screen:
likelihood-conditioned catches replace uniquely targeted events.
Assuming
the application is touch-screen enabled, and the user has a mild
tremor, such that they can do touch gestures but when they do they
generate enough false hits so that the application needs to take
account of the false detect likelihood in reported UI events.
Adaptive modifications in processing could be built around a
substitute touch-processor that reports a short list of
likelihood-rated touch targets rather than a unique target. If
the application is running in the MMI Framework and is speech enabled,
the alternate touch processor could report in the mode of emulated
speech catches and there is no more custom processing or components
required.
Blind Participant scribing Working Group tele-conference with IVR chat.
Here
you would be surprised at the concurrent audio channels that one may
use. The live voice audio is one stream; the Text-to-Speech
reading of what is coming across the chat channel is another, and what
the individual themselves is typing into the chat channel by way of
notes is a third.
Some users indeed cope with these three streams of
speech going on at once. Others want time-division multiplexing
by time-compression on the live audio and interleaving of the TTS
content, again running above normal speech rate.
Concurrent use of a refreshable Braille display and audio output is common among
those who can use Braille and can hear. This is perhaps a more common example
of
concurrent channels, where the user's ability to dispose content
aspects or streams into the channel of their choice is important.
Accessibility
probably is the pacing application domain for user-directed channel
diversity use. But it won't work unless it can be applied to both
output (SYMM
domain) and input (MMI domain). So please work with us and them to see how this can
be achieved.Single-switch user and On-screen Keyboard assistive technology.
Switch users represent the 'corner case' among "high cost per input
symbol" user categories for whom there is an established mode
of assistive technology.
Assistive tech animates focus among action opportunities. Switch action invokes (one of: pause | doIt).
best served by menu-mode transcode of dialog (interaction management document? SCXML? VoiceXML?)
Sign Language access to Forms Mode Applications including Voice Browser Applications
The
application need not be developed to be multi-modal for access to make
use of the MMI Framework. An example here is a culturally Deaf
individual, someone for whom sign language is their first language,
could find a transcode of a voice-enabled IVR as their best access path
to that application, as opposed to a GUI WebTop version that is verbose
at a high reading level in some written language. The integration
of sign language gesture recognition would work equally well with
forms-mode applications realized in VoiceXML or XForms.
Value-added
techniques in this case could include: binding intents to standard
concepts such as the ETSI vocabulary of commodity action concepts for
voice applications. If an intent in an application matches one of
these terms, binding it to the standard label for this concept would
facilitate auto-localization to sign language, as that translation
table would likely be available from a third party.
The
brute force approach of using text entry as the common point of
reference between the EMMA producing layer and the EMMA consuming layer
could work but would be extremely tedious because it forces the user to
a) work in a foreign, written language, and b) finger-spell everything.
Not cool. Better to use a SYSTRAN or BabelFish level of
auto-translation with liberal access to confirmation and correction, as
typical in beninning use of voice command.
cross-network access as an opportunity:
Just as blind consumers frequently prefer printer-friendly or
mobile-friendly versions of websites, there is plausible suspicion that
single-switch users (and high input symbol cost users in
general), if the transcoding were available, would benefit from access
to the IVR or mobile MMI version of tasks that are available as such
phone dialogs. Even when theyt are "browsing the web" in HTTP/REST as their major mode of surfing.Brain-Computer Interface.
Users using direct brain interfaces may have an even higher "cost per input symbol" but
the technology is largely a laboratory phenomenon at present. Not widely available to consumers.
fuzzy catches, but more than single-switch catch space.Personally trained speech recognition for atypical speech.
Architecture appears to handle this well as is.Personally trained spell checker for dyslexic access to, say, search.
Text
entry box re-routed through trained recognizer which returns a
likelihood-weighted vector of candidate catches, enters dialog
management emulating speech recognition catch-vector.
Goal: enable more and
better
assistive delivery contexts through economic integration around the MMI
Architecture.
These
are some novel system or service-delivery-chain concepts that
could leverage the MMI Framework into a better level of available
service for people with disabilities.Relay operator's worstation
One
of the current problems with disability access to IVRs, including
voice-enabled IVRs, is the poor performance of these applications when
text-telephone users access them through a Relay service. We
would presume that the same problems also are encountered for those
using Video Relay Service.The most common failure mode here is that the
application times out while the relay operator is in dialog with the
remote customer. It will be possible, and probably a good idea,
to band-aid this problem by affording in-band user access to adjust the
timeout settings. On the other hand, if one looks at the
architecture of these applications, the formalization of the
(gatekeeper, automated attendant) application into a VoiceXML
application could become a time-saver for the relay operator and
consumer as well as for the enterprise guarding their phone lines with
this pre-processing stage.
The putative concept is that the
relay-operator's workstation is a gateway that re-hosts the VoiceXML
application in a MMI Framework. The remote consumer has any
subset of a Total Communication terminal. This could be a simple
as a text telephone, or a Video Chat application running in a set-top
box attached to their TV and Cable service, or a full text/audio/video
total conversation connection.
If the user has only a text
telephone and the application meets the "text version complete"
requirements (text alternatives in <audio> elements, for example,
the workstation can gateway the preliminary dialog to the text
telephone line without requiring the attention of a human operator.
Once the preliminaries have been completed and the enterprise
line is calling a human, the call can be enqueued for attention by a
relay operator who will then speak the text from the text telephone
consumer and key what the enterprise-end speaker has to say.
Similarly,
runing the application on the relay operator's workstation would
facilitate video relay services. In this case the relay operator
provides expert gesture recognition and dialog management for the
information path from the consumer to the enterprise. This is a
case when the <form> constructs in the associated grammar would
probably best be presented in menu mode, and let the operator guide the
consumer through the choices, because the operator will know how to
represent the choices in indiomatic sign language and error-correct
with added dialog loops in sign. Presuming that the dialog is not
designed in sign but in some spoken language such as Chinese, Arabic or
English, it is better for the operator, when a VRS operator is
available, to provide the error-rejection managment.Controlling your exercise session
Exercise equipment, stationary bicycles, treadmills, and the
like, are proliferating though the rapid growth of the gymnasium or
health-club industry. Typically these delivery contexts are
fitted out for multimedia presentation. Often the user's hands
and feet are busy with the exercise task, but there are available
controls both pre-session and during-session that the user can command
and the exercise session will adjust. This application would
appear to be a fertile market for a user interface that is centered on
voice-in, and adapts to use a variety of display modalities. But
safety regulations are still going to require ready mechanical access
to a 'cancel' or 'abort' action. So there are multiple modalities
to manage. Diverse mixes of exotic modes. Motorcycle-like
twist grips on the hand-holds, brief coaster-brake gestures with thte
pedals, etc. Sounds like a good target use case for the
architecture.
Desiderata -- Requirements and Optimization goals
Overview
[requirement] Accessibility requires aftermarket adaptation.
Aftermarket
adaptation is changing the available input modality mix, and the
binding of application actions to the available modality gestures,
after the application author has left the scene, based solely on what
the application exposes through public interfaces.
Accessibility
requires this capability because of the diverse functional requirements
of people with disabilities and the sparse representation of any
particular profile of functional requirements in the general
population. People with disabilities can gain help from Rehabilitation
Engineers who understand their situation, and Assistive Technologies
that adapts for their needs, but they can't educate all the application
developers to desigh for all the delivery context options that are
needed to serve this population. Accessible Multi-Modal applications
will support late binding to modality particulars.
[requirement] Enabling input-substitution across modalities is a framework function.
If
input substitution is managed on an ad-hoc basis, there are as many
cases as there are before/after pairs of modality mixes. This grows
combinatorily with the number of modality options; it is utterly
unmanageable. So the framework must require from the application side
(the data constituent and the interaction management constituent)
enough information about what is there and what happens next (system
response) so that the modality-management side can carry out effective
re-binding of input particulars. Effective means (1: controllable
tasks) that there are enabled interaction path to all task outcomes and
(2: predictable changes) there is enough "what happens next"
information so that the user can be adequately prompted, and the user
feels the system response to user action is predictable.
The
Ubiquitous Web Applications Working Group, formerly known as Device
Independence, is committed by charter to developing technology that
solves this problem for discrete, atomic events such as mouse clicks
and keystrokes. This is a very important connection to leverage, but
voice and ink inputs are more fuzzy than mouse clicks and keystrokes.
This MMI will need to work with UWA to ensure that the generalization
from Intent-Based Events to EMMA input messages generality works.
3. [opportunity] Dialog-oriented applications could make possible adaptive re-planning with important benefits to some users.
Dialog-oriented
applications are those where, as in VoiceXML and this framework,
interaction with the user is collected in an interaction-management
document, and not distributed as on the traditional Web in a cloud of
hyperlinks and event handlers distributed through a corpus of
document-form screens.
Mixed initiative is the name of the
game. Dialog design dances between the Scylla and Charybdis of user
boredom and users getting lost. People with disabilities have
under-performing display or input channels that change the boundary
conditions for finding a happy medium in initiative sharing. Some users
don't see the rich graphical
display and need more steps, more
'confirm' transactions than users with lots of process cues from the
visual context. Other users have motor problems and need to emit as few
input symbols as they can get away with, so they want more an bigger
shortcuts, and fewer 'confirm' transactions. This "tunable asymmetry at
the user interface"is an adaptation axis that would yield benefits in
both directions of adjustment, and that could be enabled by the
following technical strategy.
Interactive re-planning
Charles
Rich at Mitsubishi Research Labs in his DiamondHelp system has
demonstrated how user interfaces can be more effective if the prompting
is based on use of planning tools inside the interactive loop. The
planning tools are aware of where the user is in the task flow, and
what paths are available to lead them to various outcomes.
The
Mitsubishi technology, revolving around a canonical boolean
representation of the task model, is undergoing standardization as
CEA-2018. It is quite likely that a dialog specification in SCXML would
suffice to provide enough task-model or path-look-ahead so that this
sort of adaptive planning would generate major benefits fot switch
users and others with a high cost per input symbol. The typical
assistive technology for switch users is colloquially known as an
on-screen keyboard. The open-source GOK tools is an example. In the
laboratory there is also progress on brain-computer interfaces. The
users of this technology are also input-limited. Their adapted
interaction plans would be different from those for switch users, but
would likewise presumably benefit significantly from the application of
intractive planning
techniques to the task model provided by an SCXML interaction-management document.
For
full adaptive potential, one would want to marry the task-model
approach to delivering a reference model of all paths through the
application with a voice-recognition-like statistical representation of
observed user gestures. This would allow adaptation of the dialog
to symbol confidence (confidence as regards the system recognition of
user intent) that reflects the current interactive dialog and this
user, and not just population statistics. Since both voice and
ink input have this statistical nature in the current state of the art,
the MMI Framework provides a ready springboard for extending this to
adaptation to individual differences and exceptional individuals.
User Experience:
Three cardinal questions:
- where am I?
- what is there?
- what can I do?
The aggregate application has to afford orientation to the user to answer all these cardinal questions.
In
the MMI Architecture, "where am I" may combine information from the
data constituent and the interaction managment constituent.
Either could predominate, depending on the documents at hand.
"What is there?" may fall primarily on the data constituent to
answer. "What can I do?" may fall primarily on the interaction
management constituent. But it falls to the Architecture to know
when the documents in any constituent have something to contribute to
answering these questions, and getting that information out of the
architecturally-ordained interfaces to feed to the device layer for
display so as to inform the user.
Provide alternatives for mode-dependent content. Expose and orient to structure.
Device Independence, Full Access, Personal View
Four Principles: Perceivable, Operable, Comprehensible, Compatible
Common Sense Suggestions show how much of what is needed is good business sense to begin with.
Total Conversation
is a definition of a delivery context that MMI applications should work
in. It combines audio, video, and text chat.
Dialog-oriented applications should port naturally to a
conversation platform such as this. Let's see how they can, and
we can institutionalize the ease of mapping to this delivery context
through the architecture of the platform for applications.
Threshold (Requirements):
Applications expose alternatives (WCAG, UAAG, IMS)
A
framework service manages re-allocating the binding between "what it
does" and "what does it" (CSS Media Queries, XML Handlers, DISelect,
RIF)
Note that the Ubiquitous Web Application work in
Intent-Based Events will provide some of what you need, but the fact
that voice and gesture catches don't necessarily have a unique
target-object or a unique 'caught token' means that MMI needs something
with more capability than simply substituting one GUI event for another.
Authoring tools facilitate the un-bundling of "what it does" from what does it" (ATAG, DIAL)
As
is the case with Intent-Based Events, the range of possibilities for
"what it does" is broader in the context of Multi-Modal Interaction.
This means exposing intent-based targets for aftermarket-added
input modality events will stress the data and dialog modeling more
than a one-for-one substitution of atomic events.Data constituent exposes type knowledge (structure and values) (DOM, XSL, SISR)
Why
do we need say-as? Generalize from there. Type knowledge is
necessary for literate presentation of values of typed data in
alternate modalities. For presentation, one can often get by
without this level of literacy. For capture, it is essential.One can ask interaction-manager constituent "where is this leading me?" about next-action opportunities. (DOM? SPARQL?)
This
is one area where MMI applications offer promise to perform above
classical WebTop interactions. An application that revolves
around an interaction-management document in State-Chart XML captures a
'task flow' or "ensemble of state traces" view of the application.
Thus resources for formulating "what can I do" coaching can be
gleaned from downstream steps, just as "where am I" coaching in the
classical WebTop can be deduced from labeling on sections of the page
(headers and similar navigation landmark labels).Meet Accessibility adaptations halfway at access APIs (WCAG2, WAI-ARIA)
Goal (Optimization Goals):
User- or aftermarket- choice to upgrade to "Information-State Update" as input integrator.
Torbjörn Lager claims
that an "information state update" framework has become "sort of a
standard" in the research community for composing inputs from diverse
and novel input modalities. It would seem to cover well cases
such as the touch-screen application used by a person with a mild
tremor, speech by people operating
in a second language, etc.
If one always had access to this level of capability in adapting
the interaction management, it would appear to make many more people
with mild functional atypicalities able to use the existing devices
with only a lowering of the confidence statistics and a moderately
greater use of 'confirm' transactions.Meet the Hypertext Web halfway in the Backplane work on events and particularly Intent-based events.
Meet
the Synchronized Multi-Media WG halfway (where? new TF?) on the topics
of channelized flows, interactive timelines, and graph-topology flows
mixing sequence and concurrence.
Meet those using minor
languages by reason of disability halfway in working on advances in
automated and automation-accelerated Localisation to minor languages in
particular localisation of the speech and language aspect of VoiceXML
and MMI Applications.
Prior
Art
Adaptation and Selection
Metadata:
Access metadata specifications from IMS migrating to ISO.
W3C Workshop on Metadata for Content Adaptation
The Backplane Caucus:
Whitepaper
Technical Plenary Panel (find: Rich Web Application Backplane)
Workshop Charter concept (Member link)
PLS, Interlinguas
and the translation industry
Pronunciation Lexicon Specification
allows @role discrimination between different terms; this can be a
reference to a concept node for controlled vocabularies such as the
ETSI vocabulary of common dialog actions.
Even where localisation is planned and done by people with computer tools, the best of breed such as Xplanation.com
develop large usage databases which could be developed in the
sign-language translation of one corpus and then utilised in the
auto-translation of applications outside the corpus.Although
a year old, this book surveys the background for the aplication of
adaptable multimodal interfaces to medical and home health equipment.Next
Steps
MMI F2F
Please
schedule some time at your meeting to go over these ideas and invite
observers from the WAI to offer real-time explanation as required.Backplane
activities
There
are multiple concepts of shared foundations proposed within the
'backplane' umbrella, but at least three have legs. These are the
topics of
- Shared data model, being advanced jointly by Voice Browser and Forms
- Shared event model, again supported jointly by these two
- Intent-based Events, chartered to Ubiquitous Web Applications
The
accessibility of MMI applications will be greater to the extent that
the MMI Framework can incorporate common foundations developed for
Backplane aspects such as these.Adapting Prototypes
It is hard
for MMI workers to respond to accessibility concepts without concrete
particulars, but it is equally hard for access experts to frame the
needs in particulars from the domain of their listeners in the MMI
group. It will be much easier for both sides if we can take a few
pilot applications that are exploring the MMI space and look at how
those applications could well be served in the delivery contexts of
people with disabilities. Please be prepared to share your early
prototypes with people with access and adaptation expertise as soon as
they run reliably. This will shed light on what in the
construction of the multimodal applications needs to be captured in the
public interfaces of the architectural constituents of the Framework.Preferences Metadata
The
only way to alter the presentation properties of a Voice Browser
application served over the Public Switched Telephone Network is
in-band; within the application itself. There is no "browser
chrome" or "backchannel" from the user's settings in the phone to the
engine running the IVR. All this changes with the introduction of
the Multi-Modal Interaction Architecture and Interfaces. Now
there is a Delivery Context constituent; so that the user can hope to
tell their system once some preferences as to display and input
modality utilization, and expect to have these preferences applied to a
wide range of applications without having to tell each application the
preferences all over again. The IMS Learning Consortium has
developed a schema for personal preferences; this is being standardized
at this time in JTC 1 as well. This vocabulary gives us a
starting point for the sorts of things that should be captured from the
Delivery Context constituent and applied to input and output modality
utilization in the management of an application by the Framework.
MMI
WG should work with WAI and UWA to demonstrate the use of the DCI to
pass such metadata and as a result to configure the presentation of
content and the acceptance of input.
Some of the preferences
appearing in disability-critical delivery contexts deal with the
management of channel diversity in the user interface.