Re: Incremental recognition, Unobtrusive response from Jim Barnett on 2016-11-16 (public-voiceinteraction@w3.org from November 2016)

From: Jim Barnett <1jhbarnett@gmail.com>
Date: Wed, 16 Nov 2016 07:37:46 -0500
To: public-voiceinteraction@w3.org
Message-ID: <717689ea-ba4f-af43-1443-7f813b33f07e@gmail.com>
If the facial recognition system is separate from the VXML interpreter, 
it isn't sufficient to have only partial recognition results.  External 
systems need to be able to send asynchronous events to the VXML 
interpreter.  This will be necessary in many multi-modal contexts.  
That's why we put an external events module in VoiceXML 3 
https://www.w3.org/TR/2010/WD-voicexml30-20101216/#ExternalCommunicationModule. 
(It was one of the many features that made V3 too ambitious and too 
complicated to complete.)

It would be interesting to see how those asynchronous events from the 
facial recognition system would interact with the partial recognition 
events.  For one thing, it might be necessary to change the recognition 
grammar on the fly.  In that case, would we want to keep the existing 
partial results, and use the new grammar only for new results?  Then we 
might end up with a final result that didn't match either grammar.  So 
perhaps we would have to go back and re-run recognition from the beginning.

- Jim

On 11/16/2016 3:56 AM, David Pautler wrote:
> Thank you, Deborah and Dirk.
>
> Here's a use case: An embodied conversational agent whose purpose is 
> to develop greater and greater engagement with a user by encouraging 
> the user to talk about their current life, and which /*mirrors*/ user 
> affect as detected from prosody and facial expressions through either 
> its own facial expressions or unobtrusive speech such as "uh huh" or 
> "eww".
>
> The MACH system does mirroring as part of its role as a virtual 
> interviewer (http://hoques.com/MACH.htm). I believe the SARA system by 
> Justine Cassell's group at CMU does also 
> (http://articulab.hcii.cs.cmu.edu/projects/sara/). They both appear to 
> use bespoke solutions for integrating the incremental recognition 
> results and for driving their responses that are unobtrusive (i.e. 
> they don't take the floor from the user).
>
> Toward a standardization, Dirk's idea of event handlers similar to 
> <noinput> seems on the right track to me, especially when combined 
> with uses of <data> that could puppeteer facial expressions. The event 
> handlers would have to allow integration with streaming input sources 
> other than incremental ASR, such as prosody recognizers or facial 
> expression recognizers. I think it would also be necessary to specify 
> a 'condition' attribute on each event handler; a condition would 
> usually require that values from prosodic recognition, facial 
> expression recognition, and SLU all be in certain classes or ranges 
> that define a coherent interpretation.
>
> A further comment for Dirk's notes on ASR integration: The system must 
> be able to trigger its own "uh-huh"s while ASR remains active, and 
> this audio will often have to be synchronized with lip motion. So it 
> may be that <data> inside a <partial> should be able to POST commands 
> to a TTS/audio server as well as to the 3D animation driver.
>
> What do you think? Does the use case need changes?
>
> Cheers,
> David
>
>
>
>
> On 16 November 2016 at 03:07, Dirk Schnelle-Walka 
> <dirk.schnelle@jvoicexml.org <mailto:dirk.schnelle@jvoicexml.org>> wrote:
>
>     Hey there,
>
>     some time ago I had some first thoughts with Okko Buss on an
>     integration of incremental speech processing into VoiceXML. Okk
>     was working on his PhD at the University of Bielefeld in the
>     domain of incremental dialogs.
>
>     We started to sketch something that we call VoiceXML AJAX. I
>     opened the Google Docs doucments to be viewed by everybody:
>     https://docs.google.com/document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing
>     <https://docs.google.com/document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing>
>
>     Maybe, this goes into the direction that David had in mind?
>
>     Thank you,
>     Dirk
>
>     > Deborah Dahl <Dahl@conversational-Technologies.com> hat am 15.
>     November 2016 um 04:17 geschrieben:
>     >
>     >
>     > Hi David,
>     >
>     > Thanks for your comments.
>     >
>     > This sounds like a great use case. EMMA 2.0 [1] provides some
>     capability for incremental inputs and outputs, but I think that’s
>     only a building block for the whole use case because given
>     incremental input and output, it’s still necessary for the system
>     to figure out how to respond. Also, the Web Speech API [2] has
>     incremental output for speech recognition. Again, that’s just a
>     building block.
>     >
>     > It would be very interesting if you could post a more detailed
>     description of this use case to the list, and if you have a
>     proposal that would be interesting, too.
>     >
>     > If you have links to SARA and MACH, that would also be helpful.
>     >
>     > Best,
>     >
>     > Debbie
>     >
>     > [1] https://www.w3.org/TR/emma20/
>     >
>     > [2]
>     https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
>     <https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html>
>     >
>     >
>     >
>     > From: David Pautler [mailto:david@intentionperception.org
>     <mailto:david@intentionperception.org>]
>     > Sent: Monday, November 14, 2016 8:06 PM
>     > To: public-voiceinteraction@w3.org
>     <mailto:public-voiceinteraction@w3.org>
>     > Subject: Incremental recognition, Unobtrusive response
>     >
>     >
>     >
>     > There are several multimodal virtual agents like MACH and SARA
>     that provide partial interpretation of what the user is saying or
>     expressing facially ("incremental recognition") as well as
>     backchannel 'listener actions' ("unobtrusive response") based on
>     those interpretations. This style of interaction is much more
>     human-like than the strictly turn-based style of Vxml (and related
>     W3C specs) and of all chatbot platforms I'm aware of.
>     >
>     > Is this interaction style (which might be called "IRUR") among
>     the use cases of any planned update to a W3C spec?
>     >
>     > Cheers,
>     > David
>     >
>
>
>
>
> -- 
> *David Pautler, PhD*
> AI & NLP Consultant
> https://www.linkedin.com/in/davidpautler
Received on Wednesday, 16 November 2016 12:38:18 UTC