Comments PR-mmi-arch-20120814

Hey there,

by the announcement of the W3C Proposed Recommendation status of the MMI
architecture I got aware of the recent changes to the specification that
I would like to comment.

First of all, I would like to express my appreciation to the progress
that has been made. Every step in the direction of standardization of
interchangeable components of a multimodal system will help to tackle
the system complexity of current and future multimodal architectures and
hopefully ease further advancements.

My background is in research about model-based development of multimodal
systems and more specifically in using gestures, hand poses, and body
movements to control web applications.

My current research project is about a web platform to design and run
multimodal interfaces - - earlier I was
involved in the MASP/Sercho project

After reading through the document I would like to propose to re-think
the overall structure of the document and specifically the overview
section: it’s really hard to capture the overall focus of the
specification. I think it would help if the document starts with an
introduction that is clearer about the general idea and content of the
spec: the definition of component lifecycles and their coordination
using a set of standardized events. Also an abstract sequence diagram
(similar to the ones in the end) would help to improve understanding in
the beginning of the document. The link to the mentioned Galaxy
architecture seems to be broken.

Regarding the lifecycle events/protocol I had problems figuring out if
there are any assumptions made regarding the interaction relevant data
of the modality components that needs to be fused, processed and
thereafter distributed to the media. While reading the spec I had the
feeling that the modalities considered are all of a discrete nature –
but what about direct manipulation and pointing or multi-touch or
continuous gestures? Can streams between components be established and
managed by the IM or is this out of scope for this spec?

Regarding the following terms and references I have some specific remarks:

MVC - I do neither agree that this is a "recent" approach (it was
suggested in 1979 for Smalltalk) nor that the MVC is really related to
this spec. MVC is often referred to because of its "separation of
concerns", but it also defines dependencies between this components that
do not match the proposed architecture and that are often misunderstood
(see e.g. Martin Fowler). In practice a MVC-based system architecture
easily "fragments" into a whole bunch of interrelated MVC triples that
are complex to maintain. MVC strictly separates input from output, which
has been identified as a problem for multimodal systems. Maybe its worth
to take a look at the ideas of Presentation-Abstraction-Control (PAC) by
Joëlle Coutaz et al. PAC and PAC Amodeus implement ideas like the
"russian doll" and the "nested IM".

Context - A an "outsider" it have not followed the discussion, but isn`t
it a "session" that can be joined by users and transferred between
modalities that is specified? The word "context" has been stressed a lot
(at least in science) and in my opinion complicates understanding this
spec, since multimodal systems already rely on concepts, such as the
"context-of-use" (Gaelle Calvary et al.), which could be confusing.

Transport Protocols - I understood that one basic idea of the
specification is to be as abstract and adaptable to different e.g.
transport systems but I have not understood why HTTP request/response
has been chosen (or proposed?) as one suitable solution. Wouldn’t it be
easier to use a stateful protocol for such an approach? Otherwise each
participating component has to manage and recover the communication
state internally? Why are SIP, XMPP or to some extend WebSockets not

Kind reagrds,


Sebastian Feuerstack
Department of Computer Science
Federal University of Sao Carlos - Brazil

Check out MINT 2010 - the Multimodal INTeraction Framework

Received on Tuesday, 28 August 2012 13:52:24 UTC