Clarifcations needed VoiceXML specification 2.0, 2.1 and 3.0

Dear W3C Voice Committee and mailing list members.

These issues affect Voicexml 3.0 as well as 2.1 and 2.0 since 3.0 inherits some of the 2.0 specification without no changes.

1. Termchar - input or not ?

                VoiceXML 2.0 specification defines termchar in chapter 6.3.3 as:
"The terminating DTMF character for DTMF input recognition. The default value is "#". ".

In same specification appendix C chapter "Invalid DTMF input" states that
" At each point, the user may enter DTMF which is not permitted by the active grammar(s). This causes the collected DTMF string to be invalid. Additional digits will be collected until either the termchar is pressed or the interdigittimeout has elapsed. A nomatch event is then generated."

Specification leaves a door open for different interpretations in case where the first collected digit is the "termchar".  The invalid DTMF input chapter clearly makes assumption that user has entered a DTMF that was not permitted by grammar, and after that we start to wait for termchar.  In this case <nomatch> is clear choice since user really did give some input.

What is also not clearly stated in specification is case where termchar overlaps with allowed DTMFS by grammar:  It should be stated that in this case user is not allowed to enter that DTMF as input to grammar.( Instead by entering "termchar" user terminates DTMF input recognition and result is collected without last pressed termchar DTMF within utterance )

So now we end into the case where user presses only "termchar". In this case if assumption above is correct, this termchar DTMF is not matched against any grammar and thus does not start "additional digits collection" as described in invalid DTMF input chapter, but instead it terminates recognition and recognizer should return recognition result.

-          Since No input was provided to recognize the outcome of collection could be "noinput".

-          If the termination of DTMF input is treated as request to do "final" match against grammars and  non will match we could assume "nomatch", but there is no input for the grammars either. A special case in here is are grammars that do not require any input to produce result (repeat 0- for example), but  do require "grammar matching", and here is where "termchar noinput" is different from timeout caused noinput.

This should be cleared out and clearly specified what are the expected outcomes of these cases.

Pressing plain termchar in <record> with local grammar (and as expressed below with dtmfterm=true ) is also undefined and shares quite the same concerns as expressed above, regardless if audio input is collected or not.

2. <record/> - obfuscated dtmfterm

There seems to be some kind of confusion in VoiceXML 2.0 specification about the behavior  of dtmfterm attribute. And certain amount of open issues in <record/> element generally.

Newest errata wont clear these issue, so I hope W3C Voice Committee could resolve these issues both as errata to 2.0 an 2.1 and for future 3.0.

2.1. Equivalence to local grammar

Chapter 2.3.6 specifies ():

"The <record> element contains a 'dtmfterm' attribute as a developer convenience. A 'dtmfterm' attribute with the value 'true' is equivalent to the definition of a local DTMF grammar which matches any DTMF input. The dtmfterm attribute has priority over specified local DTMF grammars."

So from this point of view you may either enter

<record name="msg" dtmfterm="true"/>

Or use local grammar  ( example 2 ):

<record name="msg">
                <grammar mode="dtmf" root="dtmf">
                                <rule id="dtmf" scope="public">
                                                <one-of>
                                                                <item>1</item>
                                                                <item>2</item>
                                                                <item>3</item>
                                                                <item>4</item>
                                                                <item>5</item>
                                                                <item>6</item>
                                                                <item>7</item>
                                                                <item>8</item>
                                                                <item>9</item>
                                                                <item>0</item>
<item>A</item>
<item>B</item>
<item>C</item>
<item>D</item>
<item>*</item>
                                                                <item>#</item>
                                                </one-of>
                                </rule>
                </grammar>
</record>

Obvious difference between these two examples comes in "name$.termchar" definition; as it says "If the dtmfterm attribute is true, and the user terminates the recording by pressing a DTMF key, then this shadow variable is the key pressed (e.g. "#"). Otherwise it is undefined."   So later example would never fill name$.termchar slot cause dtmfterm attribute is not set to true. In this context I could say that these two ways to define grammars for <record> are not that equivalent with each other.

VoiceXML.org with their certification program does even have assertions 282 and 283 to test this difference.

2.2 Grammar scoping

Confusing and even more confusing it goes when we read the definition of attribute "dtmfterm".  Value of dtmfterm attribute is defined in attribute table as: "If true, any DTMF keypress not matched by an active grammar will be treated as a match of an active (anonymous) local DTMF grammar. Defaults to true."

More precisely the part  "keypress NOT matched by an active grammar" is interesting in contrast what specification states just little bit above that dtmfterm with value true "is equivalent to the definition of a local DTMF grammar" . Does this mean that this equivalent grammar (dtmfterm=true) is not "active" and does not follow grammar match precedence as defined in 3.1.4. Following example tries to highlight this problem.

<form>
                <link dtmf="*" event="help"/>
                <record name="msg"  dtmfterm="false" modal="false">
                                <grammar mode="dtmf root="dtmf">
                                                <rule id="dtmf" scope="public">
                                                                <item>1</item>
                                                                <item>2</item>
                                                                <item>3</item>
                                                                <item>4</item>
                                                                <item>5</item>
                                                                <item>6</item>
                                                                <item>7</item>
                                                                <item>8</item>
                                                                <item>9</item>
                                                                <item>0</item>
<item>A</item>
<item>B</item>
<item>C</item>
<item>D</item>
<item>*</item>
                                                                <item>#</item>
                                                </rule>
                                </grammar>
</record>
</form>

In this case entering DTMF "*" would match local grammar and cause record termination.

But if  equivalent notation "dtmfterm="true"  would be used should the behavior change ?  Keypress did match links grammar so should we ignore grammar precedence and hit the link grammar?  In this context there is no equivalence.

2.3 DTMF timing

"Any DTMF keypress matching an active grammar terminates recording. DTMF keypresses not matching an active grammar are ignored (and therefore do not terminate or otherwise affect recording) and may optionally be removed from the signal by the platform."

As long as the "termtimeout" is set to default "0s" and all active grammar expect only one digit, statement above works, cause after pressing any digit (whose all match) the grammar goes into must terminate state and match is imminent (due termtimeout). So should this be taken so that "termtimeout" is always forced to 0 when doing record collection.

What should happened if link in example above would require two "stars" to match and activate. In grammar case after the first "*" link grammar needs more digits and field grammar is ready, so this should trigger us interdigit timeout,  correct ?  interdigit timeout should be honored in case of invalid dtmf input .

"Any DTMF keypress matching an active grammar terminates recording".
Does this mean recognition by digit (no complete matches needed or expected) , cause this would render all grammars outside record useless if they need more that one digit. For example case:
<form>
                <link event="help">
                                <grammar mode="dtmf"><!-- grammar matching ** --></grammar>
                </link>
                <link event="exit">
                                <grammar mode="dtmf"><!-- grammar matching *0 --></grammar>
                </link>
                <record name="msg" dtmfterm="false" modal="false"/>
</form>

What should happen when user presses dtmf STAR ?

2.4 Input modes.

                What should happen if inputmodes is set to "voice", DTMF modality is not used so any of above should not have any effect.  Dtmfterm should honor inputmodes.

2.5 Bargein behavior.
                There is earlier question about this sent in 2003 but I could not find any answer for this : http://lists.w3.org/Archives/Public/www-voice/2003AprJun/0031.html.

So what bargein in record case does mean ?

What is barge in -  In bargein "user starts entering input during prompt playback" . So in record case this must mean entering something else than "actual" recording. And when this happens it should terminate prompt play and skip record, producing nice clean exit with result  from collect phase.

Voice
In record case Voice input is limited to non-local grammars that cause transition to those forms / fields / menus or trigger links etc.

-          If bargeintype is speech,

o   If we are still playing prompts, recognized speech (inputprovided) should terminate prompt play and skip record. Then we should finalize speech collection. Possible nomatch is naturally thrown in current scope.

-          If bargeintype is hotword, nomatch is ignored and recognition should start again. in record case  this should happened  during prompt play and record itself.

                DTMF
                                Prompt with "hotword" type bargein
 -  During prompt play non matching input is ignored and recognition is started again.
                                Prompt with "speech" type bargein:
                                                - During prompt play any DTMF input should start dtmf collect timing  to collect (possible) rest of dtmf input. This should result to termination prompt play and skipping of record.

In both modalities, If we have started recording ( also includes  timeout phase), bargein rules won't affect anymore but in record case nomatches are ignored.

VoiceXML.org has assertion for this : 1019, succesfull  bargein terminates prompt play and skips record ()

2.6 Proposed solution

Solution for dtmfterm in record, could be defining termchar shadow variable to be filled in case a local grammar is matched and specify in all appropriate places that dtmfterm="true" is just shorthand for defining local dtmf grammar that matches all digits.

3.0 <record/> - maxtime
                Specification does not address the case where maxtime is shorter than timeout in systems where recording is started immediately after prompt and possible beep. It should be semantically incorrect in these system to define shorter maxtime than timeout.

Received on Thursday, 6 May 2010 13:30:54 UTC