- From: Dirk Schnelle-Walka <dirk.schnelle@jvoicexml.org>
- Date: Wed, 16 Nov 2016 11:46:10 +0100 (CET)
- To: public-voiceinteraction@w3.org, David Pautler <david@intentionperception.org>
David, I would like to give you writing rights to the documents. Maybe, we could continue these efforts? Dirk > David Pautler <david@intentionperception.org> hat am 16. November 2016 um 09:56 geschrieben: > > > Thank you, Deborah and Dirk. > > Here's a use case: An embodied conversational agent whose purpose is to > develop greater and greater engagement with a user by encouraging the user > to talk about their current life, and which *mirrors* user affect as > detected from prosody and facial expressions through either its own facial > expressions or unobtrusive speech such as "uh huh" or "eww". > > The MACH system does mirroring as part of its role as a virtual interviewer > (http://hoques.com/MACH.htm). I believe the SARA system by Justine > Cassell's group at CMU does also ( > http://articulab.hcii.cs.cmu.edu/projects/sara/). They both appear to use > bespoke solutions for integrating the incremental recognition results and > for driving their responses that are unobtrusive (i.e. they don't take the > floor from the user). > > Toward a standardization, Dirk's idea of event handlers similar to > <noinput> seems on the right track to me, especially when combined with > uses of <data> that could puppeteer facial expressions. The event handlers > would have to allow integration with streaming input sources other than > incremental ASR, such as prosody recognizers or facial expression > recognizers. I think it would also be necessary to specify a 'condition' > attribute on each event handler; a condition would usually require that > values from prosodic recognition, facial expression recognition, and SLU > all be in certain classes or ranges that define a coherent interpretation. > > A further comment for Dirk's notes on ASR integration: The system must be > able to trigger its own "uh-huh"s while ASR remains active, and this audio > will often have to be synchronized with lip motion. So it may be that > <data> inside a <partial> should be able to POST commands to a TTS/audio > server as well as to the 3D animation driver. > > What do you think? Does the use case need changes? > > Cheers, > David > > > > > On 16 November 2016 at 03:07, Dirk Schnelle-Walka < > dirk.schnelle@jvoicexml.org> wrote: > > > Hey there, > > > > some time ago I had some first thoughts with Okko Buss on an integration > > of incremental speech processing into VoiceXML. Okk was working on his PhD > > at the University of Bielefeld in the domain of incremental dialogs. > > > > We started to sketch something that we call VoiceXML AJAX. I opened the > > Google Docs doucments to be viewed by everybody: https://docs.google.com/ > > document/d/1jVd-K3H_8UrrSYRCjmVHSqZaonHqdHdPWaLv4QSj5c8/edit?usp=sharing > > > > Maybe, this goes into the direction that David had in mind? > > > > Thank you, > > Dirk > > > > > Deborah Dahl <Dahl@conversational-Technologies.com> hat am 15. November > > 2016 um 04:17 geschrieben: > > > > > > > > > Hi David, > > > > > > Thanks for your comments. > > > > > > This sounds like a great use case. EMMA 2.0 [1] provides some capability > > for incremental inputs and outputs, but I think that’s only a building > > block for the whole use case because given incremental input and output, > > it’s still necessary for the system to figure out how to respond. Also, the > > Web Speech API [2] has incremental output for speech recognition. Again, > > that’s just a building block. > > > > > > It would be very interesting if you could post a more detailed > > description of this use case to the list, and if you have a proposal that > > would be interesting, too. > > > > > > If you have links to SARA and MACH, that would also be helpful. > > > > > > Best, > > > > > > Debbie > > > > > > [1] https://www.w3.org/TR/emma20/ > > > > > > [2] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html > > > > > > > > > > > > From: David Pautler [mailto:david@intentionperception.org] > > > Sent: Monday, November 14, 2016 8:06 PM > > > To: public-voiceinteraction@w3.org > > > Subject: Incremental recognition, Unobtrusive response > > > > > > > > > > > > There are several multimodal virtual agents like MACH and SARA that > > provide partial interpretation of what the user is saying or expressing > > facially ("incremental recognition") as well as backchannel 'listener > > actions' ("unobtrusive response") based on those interpretations. This > > style of interaction is much more human-like than the strictly turn-based > > style of Vxml (and related W3C specs) and of all chatbot platforms I'm > > aware of. > > > > > > Is this interaction style (which might be called "IRUR") among the use > > cases of any planned update to a W3C spec? > > > > > > Cheers, > > > David > > > > > > > > > -- > *David Pautler, PhD* > AI & NLP Consultant > https://www.linkedin.com/in/davidpautler
Received on Wednesday, 16 November 2016 10:46:40 UTC