Notes on the Multi-Modal Interaction Architecture from an Accessibility Angle

based on
draft dated 22 April 2005

Executive Summary


Peril and Potential

The W3C Multi-Modal Interaction Architecture and Interfaces is a platform for Web Application delivery.  The accessibility prospects for applications delivered over this platform resembles the construction of the Chinese character for 'crisis': it combines danger and opportunity.

There is a problem:
Without strengthening the flow-down of access-supportive requirements through the MMI Architecture, many MMI applications will be developed in needlessly inaccessible form.
There is rich promise:
Separating interaction management from input device handling makes it natural to deal in intent-based-events (but not necessary).
A well-crafted platform for integrating multi-modal interaction can make it easier to integrate assistive delivery contexts with atypical mode mixes that run standard applications effectively, using mass-market constituents with a few small assistive-peculiar constituents.
Adaptation strategies learned in service to people with disabililities will make the authoring of multi-modal interaction applications significanly more portable across delivery contexts, accelerating the availability of multi-modal interaction adavantages in the delivery of applications and the more widespread use of the multi-modal interaction technology.

Requirement and response

Despite the advances in server-side adaptation of the user experience, there is a continuing requirement for MMI application authors to leave in public interfaces enough of a record so that the interaction profile can be changed after the author has left the scene without loss of function. This can be fostered in the architecture by being careful to articulate requirements on the public intefaces of the separated constituents handling the different aspects of the overall application-delivery system.

Next Steps

Use Cases

Threshold: achieve functional user experience transcoding commercial applications to support conventional access strategies.

GUI-like applications:

There will be a class of applications targeted to delivey context classes with large enough graphic displays and some sort of keyset where the accessibility strategy is the same as would be for a "point and click" oriented GUI applications.  Here, anything you can do with any modality gesture can be accomplished with some sequence of key actions; and anything that is displayed to the user can be displayed by a textual representation.  Sometimes the keystrokes take you on a slightly different path through the application-state possibilities, and sometimes the text is a separate parallel object to the default presented object, but the task-level end goals reachable are the same and the sense of the media objects are conveyed in the text alternatives.

Dialog-oriented applications:

In addition to applications that are close enough to GUI applications in their organization to be access-enabled in the same ways, there will also be a radically different kind of applications.  These are applications which closely resemble VoiceXML applications, sometimes termed voice-enabled Interactive Voice Response applications.  These are cases where the application can be completely operated by listening and speaking; but alternate modalities are also enabled for episodes in the dialog where the available additional mode capabilities of the delivery context afford greater speed or lower error rates.

These applications suggest a different strategy for access:  The application is based in and SCXML or other interaction-management document, and the interface to input mode-adaptation is in terms of menus of choices, not strings or key events.

Mild tremor and touch screen: 

likelihood-conditioned catches replace uniquely targeted events.

Assuming the application is touch-screen enabled, and the user has a mild tremor, such that they can do touch gestures but when they do they generate enough false hits so that the application needs to take account of the false detect likelihood in reported UI events.  Adaptive modifications in processing could be built around a substitute touch-processor that reports a short list of likelihood-rated touch targets rather than a unique target.  If the application is running in the MMI Framework and is speech enabled, the alternate touch processor could report in the mode of emulated speech catches and there is no more custom processing or components required.

Blind Participant scribing Working Group tele-conference with IVR chat.

Here you would be surprised at the concurrent audio channels that one may use.  The live voice audio is one stream; the Text-to-Speech reading of what is coming across the chat channel is another, and what the individual themselves is typing into the chat channel by way of notes is a third.

Some users indeed cope with these three streams of speech going on at once.  Others want time-division multiplexing by time-compression on the live audio and interleaving of the TTS content, again running above normal speech rate.

Concurrent use of a refreshable Braille display and audio output is common among
those who can use Braille and can hear.  This is perhaps a more common example
of concurrent channels, where the user's ability to dispose content aspects or streams into the channel of their choice is important.

Accessibility probably is the pacing application domain for user-directed channel diversity use.  But it won't work unless it can be applied to both output (SYMM
domain) and input (MMI domain).  So please work with us and them to see how this can
be achieved.

Single-switch user and On-screen Keyboard assistive technology.

Switch users represent the 'corner case' among "high cost per input
symbol" user categories for whom there is an established mode
of assistive technology.

Assistive tech animates focus among action opportunities.  Switch action invokes (one of: pause | doIt).

best served by menu-mode transcode of dialog (interaction management document?  SCXML? VoiceXML?)

Sign Language access to Forms Mode Applications including Voice Browser Applications

The application need not be developed to be multi-modal for access to make use of the MMI Framework.  An example here is a culturally Deaf individual, someone for whom sign language is their first language, could find a transcode of a voice-enabled IVR as their best access path to that application, as opposed to a GUI WebTop version that is verbose at a high reading level in some written language.  The integration of sign language gesture recognition would work equally well with forms-mode applications realized in VoiceXML or XForms.

Value-added techniques in this case could include: binding intents to standard concepts such as the ETSI vocabulary of commodity action concepts for voice applications.  If an intent in an application matches one of these terms, binding it to the standard label for this concept would facilitate auto-localization to sign language, as that translation table would likely be available from a third party.

The brute force approach of using text entry as the common point of reference between the EMMA producing layer and the EMMA consuming layer could work but would be extremely tedious because it forces the user to a) work in a foreign, written language, and b) finger-spell everything.  Not cool.  Better to use a SYSTRAN or BabelFish level of auto-translation with liberal access to confirmation and correction, as typical in beninning use of voice command.
cross-network access as an opportunity: Just as blind consumers frequently prefer printer-friendly or mobile-friendly versions of websites, there is plausible suspicion that single-switch  users (and high input symbol cost users in general), if the transcoding were available, would benefit from access to the IVR or mobile MMI version of tasks that are available as such phone dialogs.  Even when theyt are "browsing the web" in HTTP/REST as their major mode of surfing.

Brain-Computer Interface.

Users using direct brain interfaces may have an even higher "cost per input symbol" but
the technology is largely a laboratory phenomenon at present.  Not widely available to consumers.
fuzzy catches, but more than single-switch catch space.

Personally trained speech recognition for atypical speech.

Architecture appears to handle this well as is.

Personally trained spell checker for dyslexic access to, say, search.

Text entry box re-routed through trained recognizer which returns a likelihood-weighted vector of candidate catches, enters dialog management emulating speech recognition catch-vector.

Goal: enable more and better assistive delivery contexts through economic integration around the MMI Architecture.

These are some novel system or  service-delivery-chain concepts that could leverage the MMI Framework into a better level of available service for people with disabilities.

Relay operator's worstation

One of the current problems with disability access to IVRs, including voice-enabled IVRs, is the poor performance of these applications when text-telephone users access them through a Relay service.  We would presume that the same problems also are encountered for those using Video Relay Service.The most common failure mode here is that the application times out while the relay operator is in dialog with the remote customer.  It will be possible, and probably a good idea, to band-aid this problem by affording in-band user access to adjust the timeout settings.  On the other hand, if one looks at the architecture of these applications, the formalization of the (gatekeeper, automated attendant) application into a VoiceXML application could become a time-saver for the relay operator and consumer as well as for the enterprise guarding their phone lines with this pre-processing stage.

The putative concept is that the relay-operator's workstation is a gateway that re-hosts the VoiceXML application in a MMI Framework.  The remote consumer has any subset of a Total Communication terminal.  This could be a simple as a text telephone, or a Video Chat application running in a set-top box attached to their TV and Cable service, or a full text/audio/video total conversation connection.

If the user has only a text telephone and the application meets the "text version complete" requirements (text alternatives in <audio> elements, for example, the workstation can gateway the preliminary dialog to the text telephone line without requiring the attention of a human operator.  Once the preliminaries have been completed and the enterprise line is calling a human, the call can be enqueued for attention by a relay operator who will then speak the text from the text telephone consumer and key what the enterprise-end speaker has to say.

Similarly, runing the application on the relay operator's workstation would facilitate video relay services.  In this case the relay operator provides expert gesture recognition and dialog management for the information path from the consumer to the enterprise.  This is a case when the <form> constructs in the associated grammar would probably best be presented in menu mode, and let the operator guide the consumer through the choices, because the operator will know how to represent the choices in indiomatic sign language and error-correct with added dialog loops in sign.  Presuming that the dialog is not designed in sign but in some spoken language such as Chinese, Arabic or English, it is better for the operator, when a VRS operator is available, to provide the error-rejection managment.

Controlling your exercise session

Exercise equipment, stationary bicycles, treadmills, and the like, are proliferating though the rapid growth of the gymnasium or health-club industry.  Typically these delivery contexts are fitted out for multimedia presentation.  Often the user's hands and feet are busy with the exercise task, but there are available controls both pre-session and during-session that the user can command and the exercise session will adjust.  This application would appear to be a fertile market for a user interface that is centered on voice-in, and adapts to use a variety of display modalities.  But safety regulations are still going to require ready mechanical access to a 'cancel' or 'abort' action.  So there are multiple modalities to manage.  Diverse mixes of exotic modes.  Motorcycle-like twist grips on the hand-holds, brief coaster-brake gestures with thte pedals, etc.  Sounds like a good target use case for the architecture.

Desiderata -- Requirements and Optimization goals


[requirement] Accessibility requires aftermarket adaptation.

Aftermarket adaptation is changing the available input modality mix, and the binding of application actions to the available modality gestures, after the application author has left the scene, based solely on what the application exposes through public interfaces.

Accessibility requires this capability because of the diverse functional requirements of people with disabilities and the sparse representation of any particular profile of functional requirements in the general population. People with disabilities can gain help from Rehabilitation Engineers who understand their situation, and Assistive Technologies that adapts for their needs, but they can't educate all the application developers to desigh for all the delivery context options that are needed to serve this population. Accessible Multi-Modal applications will support late binding to modality particulars.

[requirement] Enabling input-substitution across modalities is a framework function.

If input substitution is managed on an ad-hoc basis, there are as many cases as there are before/after pairs of modality mixes. This grows combinatorily with the number of modality options; it is utterly unmanageable. So the framework must require from the application side (the data constituent and the interaction management constituent) enough information about what is there and what happens next (system response) so that the modality-management side can carry out effective re-binding of input particulars. Effective means (1: controllable tasks) that there are enabled interaction path to all task outcomes and (2: predictable changes) there is enough "what happens next" information so that the user can be adequately prompted, and the user feels the system response to user action is predictable.

The Ubiquitous Web Applications Working Group, formerly known as Device Independence, is committed by charter to developing technology that solves this problem for discrete, atomic events such as mouse clicks and keystrokes. This is a very important connection to leverage, but voice and ink inputs are more fuzzy than mouse clicks and keystrokes. This MMI will need to work with UWA to ensure that the generalization from Intent-Based Events to EMMA input messages generality works.

3. [opportunity] Dialog-oriented applications could make possible adaptive re-planning with important benefits to some users.

Dialog-oriented applications are those where, as in VoiceXML and this framework, interaction with the user is collected in an interaction-management document, and not distributed as on the traditional Web in a cloud of hyperlinks and event handlers distributed through a corpus of document-form screens.

Mixed initiative is the name of the game. Dialog design dances between the Scylla and Charybdis of user boredom and users getting lost. People with disabilities have under-performing display or input channels that change the boundary conditions for finding a happy medium in initiative sharing. Some users don't see the rich graphical
display and need more steps, more 'confirm' transactions than users with lots of process cues from the visual context. Other users have motor problems and need to emit as few input symbols as they can get away with, so they want more an bigger shortcuts, and fewer 'confirm' transactions. This "tunable asymmetry at the user interface"is an adaptation axis that would yield benefits in both directions of adjustment, and that could be enabled by the following technical strategy.

Interactive re-planning

Charles Rich at Mitsubishi Research Labs in his DiamondHelp system has demonstrated how user interfaces can be more effective if the prompting is based on use of planning tools inside the interactive loop. The planning tools are aware of where the user is in the task flow, and what paths are available to lead them to various outcomes.

The Mitsubishi technology, revolving around a canonical boolean representation of the task model, is undergoing standardization as CEA-2018. It is quite likely that a dialog specification in SCXML would suffice to provide enough task-model or path-look-ahead so that this sort of adaptive planning would generate major benefits fot switch users and others with a high cost per input symbol. The typical assistive technology for  switch users is colloquially known as an on-screen keyboard. The open-source GOK tools is an example. In the laboratory there is also progress on brain-computer interfaces. The users of this technology are also input-limited. Their adapted interaction plans would be different from those for switch users, but would likewise presumably benefit significantly from the application of intractive planning
techniques to the task model provided by an SCXML interaction-management document.

For full adaptive potential, one would want to marry the task-model approach to delivering a reference model of all paths through the application with a voice-recognition-like statistical representation of observed user gestures.  This would allow adaptation of the dialog to symbol confidence (confidence as regards the system recognition of user intent) that reflects the current interactive dialog and this user, and not just population statistics.  Since both voice and ink input have this statistical nature in the current state of the art, the MMI Framework provides a ready springboard for extending this to adaptation to individual differences and exceptional individuals.

User Experience:

Three cardinal questions:

The aggregate application has to afford orientation to the user to answer all these cardinal questions.

In the MMI Architecture, "where am I" may combine information from the data constituent and the interaction managment constituent.  Either could predominate, depending on the documents at hand.  "What is there?" may fall primarily on the data constituent to answer.  "What can I do?" may fall primarily on the interaction management constituent.  But it falls to the Architecture to know when the documents in any constituent have something to contribute to answering these questions, and getting that information out of the architecturally-ordained interfaces to feed to the device layer for display so as to inform the user.


Provide alternatives for mode-dependent content.  Expose and orient to structure.


Device Independence, Full Access, Personal View


Four Principles: Perceivable, Operable, Comprehensible, Compatible

Common Sense Suggestions show how much of what is needed is good business sense to begin with.

Total Conversation is a definition of a delivery context that MMI applications should work in.  It combines audio, video, and text chat.  Dialog-oriented applications should port naturally to a conversation platform such as this.  Let's see how they can, and we can institutionalize the ease of mapping to this delivery context through the architecture of the platform for applications.

Threshold (Requirements):

Applications expose alternatives (WCAG, UAAG, IMS)

A framework service manages re-allocating the binding between "what it does" and "what does it" (CSS Media Queries, XML Handlers, DISelect, RIF)

Note that the Ubiquitous Web Application work in Intent-Based Events will provide some of what you need, but the fact that voice and gesture catches don't necessarily have a unique target-object or a unique 'caught token' means that MMI needs something with more capability than simply substituting one GUI event for another.

Authoring tools facilitate the un-bundling of "what it does" from what does it" (ATAG, DIAL)

As is the case with Intent-Based Events, the range of possibilities for "what it does" is broader in the context of Multi-Modal Interaction.  This means exposing intent-based targets for aftermarket-added input modality events will stress the data and dialog modeling more than a one-for-one substitution of atomic events.

Data constituent exposes type knowledge (structure and values) (DOM, XSL, SISR)

Why do we need say-as?  Generalize from there.  Type knowledge is necessary for literate presentation of values of typed data in alternate modalities.  For presentation, one can often get by without this level of literacy.  For capture, it is essential.

One can ask interaction-manager constituent "where is this leading me?" about next-action opportunities. (DOM? SPARQL?)

This is one area where MMI applications offer promise to perform above classical WebTop interactions.  An application that revolves around an interaction-management document in State-Chart XML captures a 'task flow' or "ensemble of state traces" view of the application.  Thus resources for formulating "what can I do" coaching can be gleaned from downstream steps, just as "where am I" coaching in the classical WebTop can be deduced from labeling on sections of the page (headers and similar navigation landmark labels).

Meet Accessibility adaptations halfway at access APIs (WCAG2, WAI-ARIA)

Goal (Optimization Goals):

User- or aftermarket- choice to upgrade to "Information-State Update" as input integrator.

Torbjörn Lager claims that an "information state update" framework has become "sort of a standard" in the research community for composing inputs from diverse and novel input modalities.  It would seem to cover well cases such as the touch-screen application used by a person with a mild tremor, speech by people operating
in a second language, etc.  If one always had access to this level of capability in adapting the interaction management, it would appear to make many more people with mild functional atypicalities able to use the existing devices with only a lowering of the confidence statistics and a moderately greater use of 'confirm' transactions.

Meet the Hypertext Web halfway in the Backplane work on events and particularly Intent-based events.

Meet the Synchronized Multi-Media WG halfway (where? new TF?) on the topics of channelized flows, interactive timelines, and graph-topology flows mixing sequence and concurrence.

Meet those using minor languages by reason of disability halfway in working on advances in automated and automation-accelerated Localisation to minor languages in particular localisation of the speech and language aspect of VoiceXML and MMI Applications.

Prior Art

Common-Sense Suggestions for Developing Multimodal User Interfaces

Accessibility APIs

Total Conversation

Device Independence

Adaptation and Selection Metadata:

Access metadata specifications from IMS migrating to ISO.

W3C Workshop on Metadata for Content Adaptation

The Backplane Caucus:


Technical Plenary Panel (find: Rich Web Application Backplane)


Charter concept (Member link)

PLS, Interlinguas and the translation industry

 Pronunciation Lexicon Specification allows @role discrimination between different terms; this can be a reference to a concept node for controlled vocabularies such as the ETSI vocabulary of common dialog actions.

Even where localisation is planned and done by people with computer tools,  the best of breed such as develop large usage databases which could be developed in the sign-language translation of one corpus and then utilised in the auto-translation of applications outside the corpus.

Access to Medical Instrumentation book

Although a year old, this book surveys the background for the aplication of adaptable multimodal interfaces to medical and home health equipment.

Next Steps


Please schedule some time at your meeting to go over these ideas and invite observers from the WAI to offer real-time explanation as required.

Backplane activities

There are multiple concepts of shared foundations proposed within the 'backplane' umbrella, but at least three have legs.  These are the topics of
The accessibility of MMI applications will be greater to the extent that the MMI Framework can incorporate common foundations developed for Backplane aspects such as these.

Adapting Prototypes

It is hard for MMI workers to respond to accessibility concepts without concrete particulars, but it is equally hard for access experts to frame the needs in particulars from the domain of their listeners in the MMI group.  It will be much easier for both sides if we can take a few pilot applications that are exploring the MMI space and look at how those applications could well be served in the delivery contexts of people with disabilities.  Please be prepared to share your early prototypes with people with access and adaptation expertise as soon as they run reliably.  This will shed light on what in the construction of the multimodal applications needs to be captured in the public interfaces of the architectural constituents of the Framework.

Preferences Metadata

The only way to alter the presentation properties of a Voice Browser application served over the Public Switched Telephone Network is in-band; within the application itself.  There is no "browser chrome" or "backchannel" from the user's settings in the phone to the engine running the IVR.  All this changes with the introduction of the Multi-Modal Interaction Architecture and Interfaces.  Now there is a Delivery Context constituent; so that the user can hope to tell their system once some preferences as to display and input modality utilization, and expect to have these preferences applied to a wide range of applications without having to tell each application the preferences all over again.  The IMS Learning Consortium has developed a schema for personal preferences; this is being standardized at this time in JTC 1 as well.  This vocabulary gives us a starting point for the sorts of things that should be captured from the Delivery Context constituent and applied to input and output modality utilization in the management of an application by the Framework.

MMI WG should work with WAI and UWA to demonstrate the use of the DCI to pass such metadata and as a result to configure the presentation of content and the acceptance of input.

Some of the preferences appearing in disability-critical delivery contexts deal with the management of channel diversity in the user interface.