Re: Incremental recognition, Unobtrusive response

Thank you, Deborah and Dirk.

Here's a use case: An embodied conversational agent whose purpose is to
develop greater and greater engagement with a user by encouraging the user
to talk about their current life, and which *mirrors* user affect as
detected from prosody and facial expressions through either its own facial
expressions or unobtrusive speech such as "uh huh" or "eww".

The MACH system does mirroring as part of its role as a virtual interviewer
(http://hoques.com/MACH.htm). I believe the SARA system by Justine
Cassell's group at CMU does also (
http://articulab.hcii.cs.cmu.edu/projects/sara/). They both appear to use
bespoke solutions for integrating the incremental recognition results and
for driving their responses that are unobtrusive (i.e. they don't take the
floor from the user).

Toward a standardization, Dirk's idea of event handlers similar to
<noinput> seems on the right track to me, especially when combined with
uses of <data> that could puppeteer facial expressions. The event handlers
would have to allow integration with streaming input sources other than
incremental ASR, such as prosody recognizers or facial expression
recognizers. I think it would also be necessary to specify a 'condition'
attribute on each event handler; a condition would usually require that
values from prosodic recognition, facial expression recognition, and SLU
all be in certain classes or ranges that define a coherent interpretation.

A further comment for Dirk's notes on ASR integration: The system must be
able to trigger its own "uh-huh"s while ASR remains active, and this audio
will often have to be synchronized with lip motion. So it may be that
<data> inside a <partial> should be able to POST commands to a TTS/audio
server as well as to the 3D animation driver.

What do you think? Does the use case need changes?

Cheers,
David




On 16 November 2016 at 03:07, Dirk Schnelle-Walka <
dirk.schnelle@jvoicexml.org> wrote:

> Hey there,
>
> some time ago I had some first thoughts with Okko Buss on an integration
> of incremental speech processing into VoiceXML. Okk was working on his PhD
> at the University of Bielefeld in the domain of incremental dialogs.
>
> We started to sketch something that we call VoiceXML AJAX. I opened the
> Google Docs doucments to be viewed by everybody: https://docs.google.com/
> document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing
>
> Maybe, this goes into the direction that David had in mind?
>
> Thank you,
> Dirk
>
> > Deborah Dahl <Dahl@conversational-Technologies.com> hat am 15. November
> 2016 um 04:17 geschrieben:
> >
> >
> > Hi David,
> >
> > Thanks for your comments.
> >
> > This sounds like a great use case. EMMA 2.0 [1] provides some capability
> for incremental inputs and outputs, but I think that’s only a building
> block for the whole use case because given incremental input and output,
> it’s still necessary for the system to figure out how to respond. Also, the
> Web Speech API [2] has incremental output for speech recognition. Again,
> that’s just a building block.
> >
> > It would be very interesting if you could post a more detailed
> description of this use case to the list, and if you have a proposal that
> would be interesting, too.
> >
> > If you have links to SARA and MACH, that would also be helpful.
> >
> > Best,
> >
> > Debbie
> >
> > [1] https://www.w3.org/TR/emma20/
> >
> > [2] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
> >
> >
> >
> > From: David Pautler [mailto:david@intentionperception.org]
> > Sent: Monday, November 14, 2016 8:06 PM
> > To: public-voiceinteraction@w3.org
> > Subject: Incremental recognition, Unobtrusive response
> >
> >
> >
> > There are several multimodal virtual agents like MACH and SARA that
> provide partial interpretation of what the user is saying or expressing
> facially ("incremental recognition") as well as backchannel 'listener
> actions' ("unobtrusive response") based on those interpretations. This
> style of interaction is much more human-like than the strictly turn-based
> style of Vxml (and related W3C specs) and of all chatbot platforms I'm
> aware of.
> >
> > Is this interaction style (which might be called "IRUR") among the use
> cases of any planned update to a W3C spec?
> >
> > Cheers,
> > David
> >
>



-- 
*David Pautler, PhD*
AI & NLP Consultant
https://www.linkedin.com/in/davidpautler

Received on Wednesday, 16 November 2016 08:56:51 UTC