Multimodal Interaction WG questions for WebApps (especially WebAPI)

Hello WebApps WG,

The Multimodal Interaction Working Group is working on specifications
that will support distributed applications that include inputs from
different modalities, such as speech, graphics and handwriting. We
believe there's some applicability of specific WebAPI specs such
as XMLHttpRequest and Server-sent Events to our use cases and we're
hoping to get some comments/feedback/suggestions from you.

Here's a brief overview of how Multimodal Interaction and WebAPI
specs might interact.

The Multimodal Architecture [1] is a loosely coupled architecture for
multimodal user interfaces, which allows for co-resident and distributed
implementations. The aim of this design is to provide a general and flexible
framework providing interoperability among modality-specific components from
different vendors - for example, speech recognition from one vendor and
handwriting recognition from another. This framework focuses on providing a
general means for allowing these components to communicate with each other,
plus basic infrastructure for application control and platform services.

The basic components of an application conforming to the Multimodal
Architecture are (1) a set of components which provide modality-related
services, such as GUI interaction, speech recognition and handwriting
recognition, as well as more specialized modalities such as biometric input,
and (2) an Interaction Manager which coordinates inputs from different
modalities with the goal of providing a seamless and well-integrated
multimodal user experience. One use case of particular interest is a
distributed one, in which a server-based Interaction Manager (using, for
example SCXML [2]) controls a GUI component based on a (mobile or desktop)
web browser, along with a distributed speech recognition component.
"Authoring Applications for the Multimodal Architecture" [3] describes this
type of an application in more detail. If, for example, speech recognition
is distributed, the Interaction Manager receives results from the recognizer
and will need to inform the browser of a spoken user input so that the
graphical user interface can reflect that information. For example, the user
might say "November 2, 2009" and that information would be displayed in a
text field in the browser. However, this requires that the server be able to
send an event to the browser to tell it to update the display. Current
implementations do this by having the brower poll for the server for
possible updates on a frequent basis, but we believe that a better approach
would be for the browser to actually be able to receive events from the
So our main question is, what mechanisms are or will be available to 
support efficient communication among distributed components (for 
example, speech recognizers, interaction managers, and web browsers) 
that interact to create a multimodal application,(hence our interest 
in server-sent events and XMLHttpRequest)?

[1] MMI Architecture:
[2] SCXML:
[3] MMI Example:


Debbie Dahl

Received on Thursday, 24 September 2009 13:51:49 UTC