Re: VoiceXML 2.0: Official Response #1 to Candidate Recommendation Issues

Scott,
I apologize if I'm out of line in responding to this, but I am puzzled by
the proposed change to the specification. If only non-local speech grammars
are active during a recording, what is the purpose of specifying a local
grammar? In fact, why restrict the set of active grammars at all in this
case?  (why no DTMF? why no local grammars?)

- Dean

----- Original Message ----- 
From: "McGlashan, Scott" <scott.mcglashan@hp.com>
To: "Guillaume Berche" <guillaume.berche@eloquant.com>
Cc: <www-voice@w3.org>
Sent: Sunday, December 14, 2003 11:43 AM
Subject: RE: VoiceXML 2.0: Official Response #1 to Candidate Recommendation
Issues



Guillaume,

Thank you again for your timely response and your acceptance of our
disposition on these issues.

On your one remaining issue, CR5-13. We propose the following revised
resolution.

CR5-13 accepted with modifications

We believe that when recording begins is clearly defined: in Section
2.3.6, it states:

"A recording begins at the earliest after the playback of any prompts
(including the 'beep' tone if defined). As an optimization, a platform
may begin recording when the user starts speaking."

i.e. the recording may include initial silence, etc if the platform does
not use the optimization (e.g. voice activity detection). With the
optimization, the recording can begin with the user's speech. Whether
music or other audio triggers voice activity detection is
platform-specific. Note that this behavior applies independent of
whether speech recognition is supported (while the recording and
recognition processes use the same audio data stream, theese processes
are independent and therefore their voice activity detection mechanism
may be different).

The timeout interval is clearly defined: "A timeout interval is defined
to begin immediately after prompt playback (including the 'beep' tone if
defined) and its duration is determined by the 'timeout' property."

The timeout interval has an effect on both recording and recognition
(which are logically independent).

For recording, the impact is specified in "If the timeout interval is
exceeded before recording begins, then a <noinput> event is thrown." In
the case of non-optimized recording, recording always begins after
prompt playback, so <noinput> would never be thrown. With optimized
recording, however, <noinput> may be thrown if no voice activity is
detected before timeout interval elapses.

For recognition, the situation is more complex. We are modifying the
specification (due to implementation report feedback) so that if
recognition is supported during recording (this is an optional feature),
then only non-local speech grammars are active. If a non-local speech
grammar is matched by audio input, then execution is immediately
transferred its enclosing element. This raises the issue of whether a
<noinput> or <nomatch> could be thrown by the recognition process. A
<noinput> could be generated if the timeout interval has elapsed. A
<nomatch> could be generated if the audio triggers recognition but does
not match the active grammar. Our belief is that throwing these events
by the recognition process during recording is undesirable and not what
VoiceXML authors expect. Consequently, we are considered clarifying the
specification to make it clear that <noinput> and <nomatch> events are
never thrown from the recognition process during recording.


Guillaume, please let us know whether you accept this disposition. If
you do not explicit require the clarification concerning the throwing of
<noinput> and <nomatch> events by recognition during recording, the
group will use its discretion in whether the clarification needs to be
applied.

Thanks

Scott

Received on Sunday, 14 December 2003 12:15:22 UTC