RE: [v3] Some v3 functionality suggestions and scenarios from Skip Cave on 2006-08-03 (www-voice@w3.org from July to September 2006)

From: Skip Cave <Skip.Cave@intervoice.com>
Date: Thu, 3 Aug 2006 18:01:10 -0500
To: "Shane Smith" <safarishane@gmail.com>
Cc: <www-voice@w3.org>
Message-ID: <6E80E3E8D788BA4DB7EEFC88FBE9B01307A64421@SRV-EXVS01-DAL.intervoice.int>

Steve,

More comments on your comments;

[SC}2- Grammars that do NOT affect the dialog flow at all, but produce
asynchronous events >>to be handled by CCXML/scXML
[SS] Using marktime, this could be accomplished by setting marktime upon
an utterance, performing actions on the client side, and then jumping
back into your prompt using your marktime as a reference. With
bargeintype set to hotword, I imagine this would be seamless to the
caller.

[SC] I'm not sure that the "marktime" construct does what I am trying to
describe here. Here's the scenario:

A user is listening to a long voicemail. In the middle of listening to
the voicemail, the user decides that he wants to call the person that
sent the voicemail. The user says "Call Joe" or other control command,
or he presses a key that has the same effect. However, the voicemail
message continues to play, and the user continues to listen to the rest
of Joe's voicemail, after he gave the "call Joe" command. The voicemail
playback never stopped! Meanwhile, the system has spawned a concurrent
task to call Joe, and get him on the line.

This is what I mean by a grammar that doesn't affect the dialog flow. I
think the marktime property has to be set before the playback starts. In
this case, the system has no idea whether the user will make a command
in the middle of a playback or not.

[SC] 3- Grammars that don't return semantic tags, but instead affect
local parameters such >>as playback speed, loudness, audio file
position, etc.
[SS] Same, using marktime, though my guess would be a round trip to the
server. I can really see using marktime becoming ugly if we were to
request audio volume changes and needed to handle that on the server for
the upcoming http fetch of the audio file. Possible, but ugly.
If these changes are implemented in 3, from an IVR perspective I would
still want to potentially provide an audio cue that the grammar was
accepted and action taken. Conversely, we would also potentially need
an earcon to let the caller know they nomatched on their last spoken
utterance. Both of these audio cues would need to be played on top of
the current audio stream playback, assuming these work similar to the
bargeintype=hotword support today. Does v3 support combining audio
streams? Would we be able to do this without stopping the stream
playback as you suggest? Otherwise, I'd end up using marktime to
implement client side browser functionality on the server to work around
those limitations v3 is supposed to address.

[SC] this points up a basic flaw in the original design of VXML. VXML
has two ways to play a prompt - either through TTS <prompt>, or through
an audio file <audio>. Though VXML attempts to make these two mechanisms
look similar, they are really very different. The differences show up
when we start looking at media control commands such as "louder"
"faster" "skip ahead 10 seconds" and other such medial control commands.
With TTS streaming the audio over MRCP at real-time speeds, these
commands must go to the speech server, and it must implement the
commands there in the speech server. With pre-recorded audio, the
pre-recorded file will be passed to the browser at wire speed, so most
likely the media control will have to be implemented in the browser.

With TTS, any media control commands must be passed to the speech
server, as soon as they are commanded by the user. This requires an
asynchronous grammar in the VXML browser that will detect the command
(either DTMF keys or spoken hot-word) and send the event to the speech
server immediately. With audio file playback, the file typically resides
in the browser, having been transferred to the browser at wire speed
when the initial <ausi> VXML command was issued. So in this case the
browser itself must act upon the media, providing speedup/slowdown,
louder/softer, skip fwd/back, etc. algorithms. So these media
manipulation algorithms must reside in to places in the system - in the
browser, and in the speech server.

[SC] As far as I can tell, there is no way for CCXML to gracefully stop
a running VXML script without killing the browser, let alone suspend it,
with the resume state context saved automatically. And of course, there
is no current way for CCXML to tell a VXML browser to resume a certain
state after it has been suspended.
[SS] I see your point. It could be argued that this functionality
belongs in the application scope, simply causing the next fetch to spit
out vxml that would make it seem as if we picked up right where we left
off. That leaves out client side events though, with ccxml trying to
tell vxml it's time to pause.

[SC] Even worse, check out this scenario: Bill is listening to his long
voicemail from Joe. During the playback, Bill tells the system to call
Joe, and have Joe call Bill back. Meanwhile, Bill keeps on listening to
his voicemails (Joe's message played on, through Bill's command to "call
Joe").

Bill is listening to a long voicemail from someone else (let's say Dave)
when Joe calls Bill back. The system should pause the playback of Bill's
long voicemail just long enough to say " Joe is calling you back" to
Bill, then continue with Dave's message. In this case, Dave's message is
important, so Bill says, "tell him to hold just a minute, and I'll be
with him in a minute". The system spawns a task that will tell Joe to
hold on, while Bill listens to the remainder of Dave's message. When
Daves' message is finished, instead of going to the next voicemail
message, Bill tells the system to "connect me to Joe". The system does
this, but it suspends where Bill left off in his voicemail messages so
when he is finished with Joe, he can come back and hear the remaining
messages in his mailbox.

As you can see from this scenario, we need all kinds of control in the
VXML script. We have external async events that affect the dialog flow.
We have user-generate dialog events that affect the dialog flow.
CCXML/VXML today can't deal with any of this.

________________________________

<http://www.intervoice.com>

Ellis K. "Skip" Cave

CHIEF SCIENTIST
RESEARCH & DEVELOPMENT
INTERVOICE, INC.

P: (972) 454-8800 M: (214) 460-4861
skip.cave@intervoice.com

________________________________

Intervoice: Connecting People and Information.

This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are the intended recipient, you must treat the information in confidence and in accordance with all laws related to the privacy and confidentiality of such information. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies of this email, including all attachments.

Attachments

image/jpeg attachment: image001.jpg

Received on Thursday, 3 August 2006 23:03:56 UTC