RE: Incremental recognition, Unobtrusive response

I also remembered the Semaine project’s “sensitive artificial listener” http://www.semaine-project.eu/ , which I think is another example of the kind of system that David’s talking about. 

 

From: Jim Barnett [mailto:1jhbarnett@gmail.com] 
Sent: Wednesday, November 16, 2016 7:38 AM
To: public-voiceinteraction@w3.org
Subject: Re: Incremental recognition, Unobtrusive response

 

If the facial recognition system is separate from the VXML interpreter, it isn't sufficient to have only partial recognition results.  External systems need to be able to send asynchronous events to the VXML interpreter.  This will be necessary in many multi-modal contexts.  That's why we put an external events module in VoiceXML 3 https://www.w3.org/TR/2010/WD-voicexml30-20101216/#ExternalCommunicationModule.  (It was one of the many features that made V3 too ambitious and too complicated to complete.) 

It would be interesting to see how those asynchronous events from the facial recognition system would interact with the partial recognition events.  For one thing, it might be necessary to change the recognition grammar on the fly.  In that case, would we want to keep the existing partial results, and use the new grammar only for new results?  Then we might end up with a final result that didn't match either grammar.  So perhaps we would have to go back and re-run recognition from the beginning.

- Jim

On 11/16/2016 3:56 AM, David Pautler wrote:

Thank you, Deborah and Dirk. 

 

Here's a use case: An embodied conversational agent whose purpose is to develop greater and greater engagement with a user by encouraging the user to talk about their current life, and which mirrors user affect as detected from prosody and facial expressions through either its own facial expressions or unobtrusive speech such as "uh huh" or "eww".

 

The MACH system does mirroring as part of its role as a virtual interviewer (http://hoques.com/MACH.htm). I believe the SARA system by Justine Cassell's group at CMU does also (http://articulab.hcii.cs.cmu.edu/projects/sara/). They both appear to use bespoke solutions for integrating the incremental recognition results and for driving their responses that are unobtrusive (i.e. they don't take the floor from the user).

 

Toward a standardization, Dirk's idea of event handlers similar to <noinput> seems on the right track to me, especially when combined with uses of <data> that could puppeteer facial expressions. The event handlers would have to allow integration with streaming input sources other than incremental ASR, such as prosody recognizers or facial expression recognizers. I think it would also be necessary to specify a 'condition' attribute on each event handler; a condition would usually require that values from prosodic recognition, facial expression recognition, and SLU all be in certain classes or ranges that define a coherent interpretation.

 

A further comment for Dirk's notes on ASR integration: The system must be able to trigger its own "uh-huh"s while ASR remains active, and this audio will often have to be synchronized with lip motion. So it may be that <data> inside a <partial> should be able to POST commands to a TTS/audio server as well as to the 3D animation driver.

 

What do you think? Does the use case need changes?

 

Cheers,

David

 

 

 

 

On 16 November 2016 at 03:07, Dirk Schnelle-Walka <dirk.schnelle@jvoicexml.org <mailto:dirk.schnelle@jvoicexml.org> > wrote:

Hey there,

some time ago I had some first thoughts with Okko Buss on an integration of incremental speech processing into VoiceXML. Okk was working on his PhD at the University of Bielefeld in the domain of incremental dialogs.

We started to sketch something that we call VoiceXML AJAX. I opened the Google Docs doucments to be viewed by everybody: https://docs.google.com/document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing

Maybe, this goes into the direction that David had in mind?

Thank you,
Dirk

> Deborah Dahl <Dahl@conversational-Technologies.com <mailto:Dahl@conversational-Technologies.com> > hat am 15. November 2016 um 04:17 geschrieben:

>
>
> Hi David,
>
> Thanks for your comments.
>
> This sounds like a great use case. EMMA 2.0 [1] provides some capability for incremental inputs and outputs, but I think that’s only a building block for the whole use case because given incremental input and output, it’s still necessary for the system to figure out how to respond. Also, the Web Speech API [2] has incremental output for speech recognition. Again, that’s just a building block.
>
> It would be very interesting if you could post a more detailed description of this use case to the list, and if you have a proposal that would be interesting, too.
>
> If you have links to SARA and MACH, that would also be helpful.
>
> Best,
>
> Debbie
>
> [1] https://www.w3.org/TR/emma20/
>
> [2] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
>
>
>
> From: David Pautler [mailto:david@intentionperception.org <mailto:david@intentionperception.org> ]
> Sent: Monday, November 14, 2016 8:06 PM
> To: public-voiceinteraction@w3.org <mailto:public-voiceinteraction@w3.org> 
> Subject: Incremental recognition, Unobtrusive response
>
>
>
> There are several multimodal virtual agents like MACH and SARA that provide partial interpretation of what the user is saying or expressing facially ("incremental recognition") as well as backchannel 'listener actions' ("unobtrusive response") based on those interpretations. This style of interaction is much more human-like than the strictly turn-based style of Vxml (and related W3C specs) and of all chatbot platforms I'm aware of.
>
> Is this interaction style (which might be called "IRUR") among the use cases of any planned update to a W3C spec?
>
> Cheers,
> David
>





 

-- 

David Pautler, PhD

AI & NLP Consultant

 <https://www.linkedin.com/in/davidpautler> https://www.linkedin.com/in/davidpautler

 

Received on Thursday, 17 November 2016 14:33:48 UTC