Re: Incremental recognition, Unobtrusive response from Dirk Schnelle-Walka on 2016-11-16 (public-voiceinteraction@w3.org from November 2016)

From: Dirk Schnelle-Walka <dirk.schnelle@jvoicexml.org>
Date: Wed, 16 Nov 2016 11:46:10 +0100 (CET)
To: public-voiceinteraction@w3.org, David Pautler <david@intentionperception.org>
Message-ID: <996633039.421907.1479293170772@communicator.strato.de>
David,

I would like to give you writing rights to the documents. Maybe, we could continue these efforts?

Dirk

> David Pautler <david@intentionperception.org> hat am 16. November 2016 um 09:56 geschrieben:
> 
> 
> Thank you, Deborah and Dirk.
> 
> Here's a use case: An embodied conversational agent whose purpose is to
> develop greater and greater engagement with a user by encouraging the user
> to talk about their current life, and which *mirrors* user affect as
> detected from prosody and facial expressions through either its own facial
> expressions or unobtrusive speech such as "uh huh" or "eww".
> 
> The MACH system does mirroring as part of its role as a virtual interviewer
> (http://hoques.com/MACH.htm). I believe the SARA system by Justine
> Cassell's group at CMU does also (
> http://articulab.hcii.cs.cmu.edu/projects/sara/). They both appear to use
> bespoke solutions for integrating the incremental recognition results and
> for driving their responses that are unobtrusive (i.e. they don't take the
> floor from the user).
> 
> Toward a standardization, Dirk's idea of event handlers similar to
> <noinput> seems on the right track to me, especially when combined with
> uses of <data> that could puppeteer facial expressions. The event handlers
> would have to allow integration with streaming input sources other than
> incremental ASR, such as prosody recognizers or facial expression
> recognizers. I think it would also be necessary to specify a 'condition'
> attribute on each event handler; a condition would usually require that
> values from prosodic recognition, facial expression recognition, and SLU
> all be in certain classes or ranges that define a coherent interpretation.
> 
> A further comment for Dirk's notes on ASR integration: The system must be
> able to trigger its own "uh-huh"s while ASR remains active, and this audio
> will often have to be synchronized with lip motion. So it may be that
> <data> inside a <partial> should be able to POST commands to a TTS/audio
> server as well as to the 3D animation driver.
> 
> What do you think? Does the use case need changes?
> 
> Cheers,
> David
> 
> 
> 
> 
> On 16 November 2016 at 03:07, Dirk Schnelle-Walka <
> dirk.schnelle@jvoicexml.org> wrote:
> 
> > Hey there,
> >
> > some time ago I had some first thoughts with Okko Buss on an integration
> > of incremental speech processing into VoiceXML. Okk was working on his PhD
> > at the University of Bielefeld in the domain of incremental dialogs.
> >
> > We started to sketch something that we call VoiceXML AJAX. I opened the
> > Google Docs doucments to be viewed by everybody: https://docs.google.com/
> > document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing
> >
> > Maybe, this goes into the direction that David had in mind?
> >
> > Thank you,
> > Dirk
> >
> > > Deborah Dahl <Dahl@conversational-Technologies.com> hat am 15. November
> > 2016 um 04:17 geschrieben:
> > >
> > >
> > > Hi David,
> > >
> > > Thanks for your comments.
> > >
> > > This sounds like a great use case. EMMA 2.0 [1] provides some capability
> > for incremental inputs and outputs, but I think that’s only a building
> > block for the whole use case because given incremental input and output,
> > it’s still necessary for the system to figure out how to respond. Also, the
> > Web Speech API [2] has incremental output for speech recognition. Again,
> > that’s just a building block.
> > >
> > > It would be very interesting if you could post a more detailed
> > description of this use case to the list, and if you have a proposal that
> > would be interesting, too.
> > >
> > > If you have links to SARA and MACH, that would also be helpful.
> > >
> > > Best,
> > >
> > > Debbie
> > >
> > > [1] https://www.w3.org/TR/emma20/
> > >
> > > [2] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
> > >
> > >
> > >
> > > From: David Pautler [mailto:david@intentionperception.org]
> > > Sent: Monday, November 14, 2016 8:06 PM
> > > To: public-voiceinteraction@w3.org
> > > Subject: Incremental recognition, Unobtrusive response
> > >
> > >
> > >
> > > There are several multimodal virtual agents like MACH and SARA that
> > provide partial interpretation of what the user is saying or expressing
> > facially ("incremental recognition") as well as backchannel 'listener
> > actions' ("unobtrusive response") based on those interpretations. This
> > style of interaction is much more human-like than the strictly turn-based
> > style of Vxml (and related W3C specs) and of all chatbot platforms I'm
> > aware of.
> > >
> > > Is this interaction style (which might be called "IRUR") among the use
> > cases of any planned update to a W3C spec?
> > >
> > > Cheers,
> > > David
> > >
> >
> 
> 
> 
> -- 
> *David Pautler, PhD*
> AI & NLP Consultant
> https://www.linkedin.com/in/davidpautler
Received on Wednesday, 16 November 2016 10:46:40 UTC