RE: VoiceXML 2.0: Official Response #1 to Candidate Recommendation Issues from McGlashan, Scott on 2003-12-15 (www-voice@w3.org from October to December 2003)

From: McGlashan, Scott <scott.mcglashan@hp.com>
Date: Mon, 15 Dec 2003 12:32:53 +0100
To: "Dean Sturtevant" <deansturtevant@comcast.net>
Cc: <www-voice@w3.org>
Message-ID: <78CB74A9DA19BD4E9F720B51583E75A9300E5F@sooexc01.emea.cpqcorp.net>
DTMF local grammars can still be specified - so you can terminate the
recording with specific DTMF sequences, etc. Local speech grammars were
not supported by any implementations during the implementation report
phase, hence the decision to withdraw this (optional) feature. 
 
Scott McGlashan
Co-chair, W3C VBWG

-----Original Message-----
From: Dean Sturtevant [mailto:deansturtevant@comcast.net] 
Sent: Sunday, December 14, 2003 18:16
To: McGlashan, Scott; Guillaume Berche
Cc: www-voice@w3.org
Subject: Re: VoiceXML 2.0: Official Response #1 to Candidate
Recommendation Issues


Scott,
I apologize if I'm out of line in responding to this, but I am puzzled
by the proposed change to the specification. If only non-local speech
grammars are active during a recording, what is the purpose of
specifying a local grammar? In fact, why restrict the set of active
grammars at all in this case?  (why no DTMF? why no local grammars?)

- Dean

----- Original Message ----- 
From: "McGlashan, Scott" <scott.mcglashan@hp.com>
To: "Guillaume Berche" <guillaume.berche@eloquant.com>
Cc: <www-voice@w3.org>
Sent: Sunday, December 14, 2003 11:43 AM
Subject: RE: VoiceXML 2.0: Official Response #1 to Candidate
Recommendation Issues



Guillaume,

Thank you again for your timely response and your acceptance of our
disposition on these issues.

On your one remaining issue, CR5-13. We propose the following revised
resolution.

CR5-13 accepted with modifications

We believe that when recording begins is clearly defined: in Section
2.3.6, it states:

"A recording begins at the earliest after the playback of any prompts
(including the 'beep' tone if defined). As an optimization, a platform
may begin recording when the user starts speaking."

i.e. the recording may include initial silence, etc if the platform does
not use the optimization (e.g. voice activity detection). With the
optimization, the recording can begin with the user's speech. Whether
music or other audio triggers voice activity detection is
platform-specific. Note that this behavior applies independent of
whether speech recognition is supported (while the recording and
recognition processes use the same audio data stream, theese processes
are independent and therefore their voice activity detection mechanism
may be different).

The timeout interval is clearly defined: "A timeout interval is defined
to begin immediately after prompt playback (including the 'beep' tone if
defined) and its duration is determined by the 'timeout' property."

The timeout interval has an effect on both recording and recognition
(which are logically independent).

For recording, the impact is specified in "If the timeout interval is
exceeded before recording begins, then a <noinput> event is thrown." In
the case of non-optimized recording, recording always begins after
prompt playback, so <noinput> would never be thrown. With optimized
recording, however, <noinput> may be thrown if no voice activity is
detected before timeout interval elapses.

For recognition, the situation is more complex. We are modifying the
specification (due to implementation report feedback) so that if
recognition is supported during recording (this is an optional feature),
then only non-local speech grammars are active. If a non-local speech
grammar is matched by audio input, then execution is immediately
transferred its enclosing element. This raises the issue of whether a
<noinput> or <nomatch> could be thrown by the recognition process. A
<noinput> could be generated if the timeout interval has elapsed. A
<nomatch> could be generated if the audio triggers recognition but does
not match the active grammar. Our belief is that throwing these events
by the recognition process during recording is undesirable and not what
VoiceXML authors expect. Consequently, we are considered clarifying the
specification to make it clear that <noinput> and <nomatch> events are
never thrown from the recognition process during recording.


Guillaume, please let us know whether you accept this disposition. If
you do not explicit require the clarification concerning the throwing of
<noinput> and <nomatch> events by recognition during recording, the
group will use its discretion in whether the clarification needs to be
applied.

Thanks

Scott
Received on Monday, 15 December 2003 06:33:08 UTC