- From: Skip Cave <Skip.Cave@intervoice.com>
- Date: Thu, 3 Aug 2006 18:01:10 -0500
- To: "Shane Smith" <safarishane@gmail.com>
- Cc: <www-voice@w3.org>
- Message-ID: <6E80E3E8D788BA4DB7EEFC88FBE9B01307A64421@SRV-EXVS01-DAL.intervoice.int>
Steve, More comments on your comments; [SC}2- Grammars that do NOT affect the dialog flow at all, but produce asynchronous events >>to be handled by CCXML/scXML [SS] Using marktime, this could be accomplished by setting marktime upon an utterance, performing actions on the client side, and then jumping back into your prompt using your marktime as a reference. With bargeintype set to hotword, I imagine this would be seamless to the caller. [SC] I'm not sure that the "marktime" construct does what I am trying to describe here. Here's the scenario: A user is listening to a long voicemail. In the middle of listening to the voicemail, the user decides that he wants to call the person that sent the voicemail. The user says "Call Joe" or other control command, or he presses a key that has the same effect. However, the voicemail message continues to play, and the user continues to listen to the rest of Joe's voicemail, after he gave the "call Joe" command. The voicemail playback never stopped! Meanwhile, the system has spawned a concurrent task to call Joe, and get him on the line. This is what I mean by a grammar that doesn't affect the dialog flow. I think the marktime property has to be set before the playback starts. In this case, the system has no idea whether the user will make a command in the middle of a playback or not. [SC] 3- Grammars that don't return semantic tags, but instead affect local parameters such >>as playback speed, loudness, audio file position, etc. [SS] Same, using marktime, though my guess would be a round trip to the server. I can really see using marktime becoming ugly if we were to request audio volume changes and needed to handle that on the server for the upcoming http fetch of the audio file. Possible, but ugly. If these changes are implemented in 3, from an IVR perspective I would still want to potentially provide an audio cue that the grammar was accepted and action taken. Conversely, we would also potentially need an earcon to let the caller know they nomatched on their last spoken utterance. Both of these audio cues would need to be played on top of the current audio stream playback, assuming these work similar to the bargeintype=hotword support today. Does v3 support combining audio streams? Would we be able to do this without stopping the stream playback as you suggest? Otherwise, I'd end up using marktime to implement client side browser functionality on the server to work around those limitations v3 is supposed to address. [SC] this points up a basic flaw in the original design of VXML. VXML has two ways to play a prompt - either through TTS <prompt>, or through an audio file <audio>. Though VXML attempts to make these two mechanisms look similar, they are really very different. The differences show up when we start looking at media control commands such as "louder" "faster" "skip ahead 10 seconds" and other such medial control commands. With TTS streaming the audio over MRCP at real-time speeds, these commands must go to the speech server, and it must implement the commands there in the speech server. With pre-recorded audio, the pre-recorded file will be passed to the browser at wire speed, so most likely the media control will have to be implemented in the browser. With TTS, any media control commands must be passed to the speech server, as soon as they are commanded by the user. This requires an asynchronous grammar in the VXML browser that will detect the command (either DTMF keys or spoken hot-word) and send the event to the speech server immediately. With audio file playback, the file typically resides in the browser, having been transferred to the browser at wire speed when the initial <ausi> VXML command was issued. So in this case the browser itself must act upon the media, providing speedup/slowdown, louder/softer, skip fwd/back, etc. algorithms. So these media manipulation algorithms must reside in to places in the system - in the browser, and in the speech server. [SC] As far as I can tell, there is no way for CCXML to gracefully stop a running VXML script without killing the browser, let alone suspend it, with the resume state context saved automatically. And of course, there is no current way for CCXML to tell a VXML browser to resume a certain state after it has been suspended. [SS] I see your point. It could be argued that this functionality belongs in the application scope, simply causing the next fetch to spit out vxml that would make it seem as if we picked up right where we left off. That leaves out client side events though, with ccxml trying to tell vxml it's time to pause. [SC] Even worse, check out this scenario: Bill is listening to his long voicemail from Joe. During the playback, Bill tells the system to call Joe, and have Joe call Bill back. Meanwhile, Bill keeps on listening to his voicemails (Joe's message played on, through Bill's command to "call Joe"). Bill is listening to a long voicemail from someone else (let's say Dave) when Joe calls Bill back. The system should pause the playback of Bill's long voicemail just long enough to say " Joe is calling you back" to Bill, then continue with Dave's message. In this case, Dave's message is important, so Bill says, "tell him to hold just a minute, and I'll be with him in a minute". The system spawns a task that will tell Joe to hold on, while Bill listens to the remainder of Dave's message. When Daves' message is finished, instead of going to the next voicemail message, Bill tells the system to "connect me to Joe". The system does this, but it suspends where Bill left off in his voicemail messages so when he is finished with Joe, he can come back and hear the remaining messages in his mailbox. As you can see from this scenario, we need all kinds of control in the VXML script. We have external async events that affect the dialog flow. We have user-generate dialog events that affect the dialog flow. CCXML/VXML today can't deal with any of this. ________________________________ <http://www.intervoice.com> Ellis K. "Skip" Cave CHIEF SCIENTIST RESEARCH & DEVELOPMENT INTERVOICE, INC. P: (972) 454-8800 M: (214) 460-4861 skip.cave@intervoice.com ________________________________ Intervoice: Connecting People and Information. This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are the intended recipient, you must treat the information in confidence and in accordance with all laws related to the privacy and confidentiality of such information. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies of this email, including all attachments.
Attachments
- image/jpeg attachment: image001.jpg
Received on Thursday, 3 August 2006 23:03:56 UTC