RE: Multimodal Input and Time

Dear Juan,

Thank you very much for your comments on timing and confidence
information. The Multimodal Interaction Working Group has recently 
published a new Working Draft of the EMMA (Extensible MultiModal 
Annotation)specification [1] which provides for the representation 
of timestamp and confidence information for user inputs in any 
modality. It appears that the notion of "absolute timestamp" in EMMA 
may address your suggestion in (1) and the "relative timestamp" may 
address your suggestion in (2). The Multimodal Interaction Working group
would very much welcome your feedback as to whether these two
annotations satisfy your needs, and more generally, we welcome
your feedback on the EMMA specification as a whole.

best regards,

Debbie Dahl
MMI WG Chair

[1] http://www.w3.org/TR/emma/

> -----Original Message-----
> From: www-multimodal-request@w3.org 
> [mailto:www-multimodal-request@w3.org] On Behalf Of Juan E. Gilbert
> Sent: Wednesday, November 24, 2004 6:54 PM
> To: www-voice@w3.org; www-multimodal@w3.org
> Subject: Multimodal Input and Time
> 
> 
> 
> To Whom It May Concern:
> 
> I am writing to give you a recommendation for voice and 
> multimodality.  
> I have run into a small problem when using speech recognition for 
> multimodal applications.  As you know, when speech 
> recognition occurs, 
> each word is assigned a confidence level.  This is stored in 
> an object, 
> typically an XML object.  It would be very nice if time 
> information was 
> stored as well.  Time information falls under 2 categories.
> 
> 1. For each recognized word, there should be a time stamp of when the 
> confidence score was assigned or when the word was 
> recognized. This time 
> stamp could be obtained from the clock on the speech recognizer.
> 
> 2. For each recognized word, there should be an elapsed time stamp. 
> Elapsed time is the time captured from a stop watch. For 
> example, when 
> the recognizer is started a stop watch begins.  When a word is 
> recognized it is assigned a time stamp in milliseconds.  Each 
> successive 
> word/recognition would have an increasing value in milliseconds.
> 
> I think this informations is critical across all input modes 
> for voice 
> and multimodal processing.  Speech recognition, gestures, 
> etc. could all 
> benefit from using both of these time stamp values.  In fact, this is 
> very easy to implement because all of this information is 
> being used any 
> way.  For example, in SALT, babbletimeout and silence are 
> using a stop 
> watch.  This would require 2 new object attributes that would exists 
> next to the confidence level.  I think this needs to be 
> incorporated in 
> all recognition and input standards.  Basically, any place there is a 
> confidence score, there should be these 2 time stamps.  These time 
> stamps will allow developers to process multimodal events 
> with respect 
> to time.
> 
> Thanks,
> 
> -- 
> Juan E. Gilbert, Ph.D.
> Auburn University
> Human Centered Computing Lab - http://interact.cse.eng.auburn.edu/
> Department of Computer Science and Software Engineering
> 107 Dunstan Hall
> Auburn, AL 36849-5347  U.S.A.
> (334) 844-6316 (O)
> (334) 844-6329 (F)
> gilbert@eng.auburn.edu
> http://www.eng.auburn.edu/~gilbert/
> 
> 
> 
> 

Received on Tuesday, 21 December 2004 18:31:44 UTC