Multimodal Input and Time

To Whom It May Concern:

I am writing to give you a recommendation for voice and multimodality.  
I have run into a small problem when using speech recognition for 
multimodal applications.  As you know, when speech recognition occurs, 
each word is assigned a confidence level.  This is stored in an object, 
typically an XML object.  It would be very nice if time information was 
stored as well.  Time information falls under 2 categories.

1. For each recognized word, there should be a time stamp of when the 
confidence score was assigned or when the word was recognized. This time 
stamp could be obtained from the clock on the speech recognizer.

2. For each recognized word, there should be an elapsed time stamp. 
Elapsed time is the time captured from a stop watch. For example, when 
the recognizer is started a stop watch begins.  When a word is 
recognized it is assigned a time stamp in milliseconds.  Each successive 
word/recognition would have an increasing value in milliseconds.

I think this informations is critical across all input modes for voice 
and multimodal processing.  Speech recognition, gestures, etc. could all 
benefit from using both of these time stamp values.  In fact, this is 
very easy to implement because all of this information is being used any 
way.  For example, in SALT, babbletimeout and silence are using a stop 
watch.  This would require 2 new object attributes that would exists 
next to the confidence level.  I think this needs to be incorporated in 
all recognition and input standards.  Basically, any place there is a 
confidence score, there should be these 2 time stamps.  These time 
stamps will allow developers to process multimodal events with respect 
to time.


Juan E. Gilbert, Ph.D.
Auburn University
Human Centered Computing Lab -
Department of Computer Science and Software Engineering
107 Dunstan Hall
Auburn, AL 36849-5347  U.S.A.
(334) 844-6316 (O)
(334) 844-6329 (F)

Received on Thursday, 25 November 2004 05:57:21 UTC