Re: Bargein as defined in VoiceXML 2.0 and 2.1 - questions and comments from Yang, Xu on 2009-03-18 (www-voice@w3.org from January to March 2009)

From: Yang, Xu <Xu.Yang@Aspect.com>
Date: Wed, 18 Mar 2009 11:54:57 -0400
To: "www-voice@w3.org" <www-voice@w3.org>
Message-ID: <F2DDFEEC5C982D469FD25338393482FB906F140304@ASP1CMS2.aspect.com>

Teemu,
I got some comments, hope it clarify things. please see below.

Thanks,
Xu

I did re-found this old thread [Evidently someone is confused as to what
"barge in" means.] about bargein and bargintype = "hotword", with no
proper answer from Committee. Is there such existing nowadays when 3.0
is in progress and even 2.1 is out .

What Dean in his mail express is actually exactly the same way I see
barge-in. And this was the way how we implemented. My interpretation
of bargein is : bargein happens, when user actively gives input during
barge-able media play and by doing this causes media play to terminate.

My original thought was that "hotword" bargein differs from "speech"
bargein by the definition, what is treated as input that causes media
play to terminate, and has nothing to do with the outcome of collection.
[Xu] Agree, to be more specific, "speech" treats voice as input, and "hotword" treats "recognized utterance/dtmf" as input. "hotword" has to terminate the prompt based on the collection though, so I won't say it has nothing to do with the outcome of collection.

With current definition as in VoiceXML 2.0 recommendations chapter
4.1.5.1 - "bargeintype" hotword declares more collection handling than
actually barge-in behavior. I guess that this should be cleared out
somehow.
[Xu] If there are ambiguities from the spec, I think we should correct it. My understanding for the "barge-in behavior" is nothing more than "when to terminate the prompt", for "speech", is when voice/dtmf detected. For "hotword", is when a recognized utterance/dtmf is collected. I believe we will all agree this part based on the spec.

By reading the chapter and defined consequences of "hotword", easily
leads us to think, that the outcome of user entering incorrect input
during timeout period, is two ended. ( and by my definition no barge-in
even occurred ) If bargeintype "hotword" was used nothing happens and
noinput is thrown and if bargeintype "speech" was used nomatch is
thrown.
[Xu] This is correct interpretation. In "hotword" case, assuming no further input after the prompt play is done, and no further input prior timeout expired. In the "speech" case, the prompt will be terminated once utterance/dtmf is detected.

Still it is defined for input, that starting input during timeout
period causes timeout to cancelled and interdigit or termtimeout to be
used. Only exception in here is exact match with no termchar defined
that leads to immediate collection end.
[Xu] Once the prompt is done, and timeout period started, it is out of the scope of the bargein, doesn't matter what bargein type it is. So we can say it is not the scope of the hotword/speech bargein type related.

It would make sense to me if bargeintype "hotword " would only affect to
those collections that do _start_ during prompt play (bargein) and _end_
while prompt is still playing, or timeout period has not yet elapsed. In
case of non bargeable prompt(s) bargeintype property would make no
difference since no prompt barge in may occur and input is stared
earliest at the begin of timeout period.
[Xu] I would say it would not cover the "timeout period has not yet elapsed". Cause it is for bargein, the timeout period is out of the scope of the prompt.
Yes, for non bargeable case, I the bargeintype would most likely be ignored by the platform.

I guess that this is the original idea with hotword, since in VoiceXML
it is quite easy to restart collection in case of <nomatch> but
bargeintype "hotword" is currently our only tool to prevent incorrect
input from interrupting prompt play.

[Xu] I did see our customer wrote application this way, however, in my opinion, the hotword must have been originally designed for platform listen for small number of command words. But since this functionality was so close to "keep play prompt if no matched utterance/dtmf being input" (selective-barge-in in Nuance's term), it is a valid use case.

DTMF input that does not match any grammar will cause system to collect
more digits until interdigit timeout is elapsed and eventually throw
nomatch. If bargeintype "hotword" is used, should the initial DTMF that
caused the system to go into this, be discarded. Or should only the
complete collection be discarded ? This is not defined but to make some
analogue with voice input collection, discarding the complete collection
sound better to me.

[Xu] This is the spot that the 2.0 spec does not clearly specify, which may cause confusion: for hotword, the nomatched input during prompt play should be discarded or not. However, if we think the hotword bargein is only valid during the prompt play, we will naturally deduct that: any input during the prompt play is gone with the prompt. We will start a new recog after play done. (As you said, this is analogue for dtmf and voice. I made the same interpretation). I won't object to add one line to define this in the new spec 3.0.

For example here are few examples from "hotword" barge-in case where
user may enter any number of DTMF "1"

Here is timing sequence of case when caller keeps entering DTMF-1 past
the timeout period, and then presses DTMF-2. By the definition we were
not on timeout period anymore and nomatch should be thrown !)

NI = NOINPUT

NM = NOMATCH

--IDT-- = Interdigit timeout period

| PROMPT PLAY | TIMEOUT |

| Bargeintype="HW" | |

--------------------------------------

\--IDT-\--IDT-\--IDT-\--IDT-\--IDT-\--IDT-\--IDT--\

DTMF-1 DTMF-1 DTMF-1 DTMF-1 DTMF-1 DTMF-1 DTMF-2 NOMATCH

[Xu]Correct behavior according to the spec. assuming "dtmf-1 dtmf-2" does not match any activated grammars.

Here is another sequence, User starts entering incorrect sequence during
prompt play, since bargein input was started during prompt play and
timeout was not elapsed when the first input was completed, collection
resulted to noinput.

| PROMPT PLAY | TIMEOUT |

| Bargeintype="HW" | |

--------------------------------------

\--IDT-\--IDT--\--IDT--\ \

DTMF-1 DTMF-1 DTMF-2 NM NI

[Xu]Assume typo of "NM" in above line. This correctly reflected the spec.

Here is yet another sequence, Since input was started during timeout
period it should be treated as "non" bargein type and follow normal
input collection and result to nomatch.

| PROMPT PLAY | TIMEOUT |

| Bargeintype="HW" | |

--------------------------------------

\--IDT-\--IDT--\--IDT--\

DTMF-1 DTMF-1 DTMF-2 NM

DTMF timing diagrams in VoiceXML specification wont contain any of these
hotword cases nor they won't contain any of failing cases either.
Defining those would clear up a lot.

[Xu] I think from the existing 2.0 spec, developer can correctly deduct the above behavior.
The Appendix D - Timing Properties of 2.0 spec is designated for more about different timing definition, instead of cover all the possible use cases. Which means, without the diagrams, the properties was not well defined in previous sections.

Was this the idea that You had in Your minds? Or do you really mean
that with "hotword" there is no such thing as nomatch. (which then
limits VoiceXML developer quite much since it removes some vital
information about user input. ) and to barge prompt does not actually
mean giving input during prompt play.

[Xu]Yes, your interpretations above correctly reflect the spec.
The 2.0 spec emphasized "no such thing as no match for hotword" only means if the utterance/dtmf does not match the grammar during the valid duration of the hotword bargein, that is during the prompt play, platform should not throw nomatch. But it does not cover the time after prompt is played, and timeout expired. And the post-promptplay behavior should follow the rest of the spec, and which in your use cases, correctly demonstrated them.

Just remember these when you define 3.0, currently the working draft is
such skeleton of open ideas that giving any comment about it is quite
hard indeed.

- Teemu

Received on Thursday, 19 March 2009 01:32:44 UTC