RE: My comments about the Speech Synthesis mark up language

Dear Alberto,

Thank you for your review of the SSML specification.  It's been two years,
but we thought it appropriate to send an official response as if you had
sent the comment today.

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance.

Again, thank you for your input.

-- Dan Burnett
Synthesis Team Leader, VBWG

[VBWG responses are embedded, preceded by '>>>']

-----Original Message-----
From: []On
Behalf Of Ciaramella Alberto
Sent: Monday, January 15, 2001 1:35 AM
To: 'Larson, Jim A';;
Subject: My comments about the Speech Synthesis mark up language

Here follow my comments about the Speech Synthesis mark up language
specification of the Speech Interface Framework, draft dated 3 january

- 1) paragraph 1.2. It shows that the processing, in different stages, is
influenced both by the mark up support and by not mark-up behaviour.
I suggest to add here as a general rule that "explicit mark up always takes
the precedence over not-mark up behaviour".
This kind of rule in the present version of the document is presented as the
usage note 2 at the end of 2.4, but it is definitely more general than this.

>>> Rejected.  Behavior in the specification is determined
>>> on an element-by-element basis because the markup in some
>>> cases might try to do something which the engine knows to
>>> be inappropriate.  As an example, a prosody contour with
>>> sequential pitch targets that vary wildly will not be observed
>>> very closely by any commercial engine because the audio would
>>> be exceedingly unnatural and likely unintelligible.  Additionally,
>>> requiring the markup behavior to take precedence would be
>>> difficult to enforce without audio checks that measure not
>>> just conformance, but performance.  We do not believe it is
>>> appropriate for the specification to render too fine an opinion
>>> on performance.

- 2) paragraph 1.2 point 6: waveform production mark up support. 
I do not agree that "the TTS markup does not provide explicit controls over
the generation of the waveforms". In fact with mark-ups already introduced
in point 5 for controlling the prosody you can control both the volume and
the speed. 

>>> Accepted.  We will remove this sentence.

- 3) Other than this, always in paragraph 1.2 point 6, it is advisable to
identify if a sentence can or can not be interrupted by a barge-in. This
feature is present in section 4.1.5 of the document "Voice Extensible mark
up language", version 2. Thus poses another more general issue: what is the
relationship between the document "Speech Synthesis mark up language" and
the chapter 4 (System Output) of the Voice XML version 2. It must be
explicitated, taking care not to duplicate the definitions between these
documents in order to simplify the document maintenance.

>>> Rejected.  We believe this comment has been addressed
>>> by changes to both SSML and VoiceXML. Although examples
>>> of SSML embedded in other languages are appropriate for
>>> this document, specific details are not. Barge-in behavior,
>>> for example, is outside the scope of this specification.

Best regards

Alberto Ciaramella
via Reiss Romoli 274
10148 Torino (Italy)
tel. +39 011 228 6210
fax. +39 011 228 6207


-----Messaggio originale-----
Da: Larson, Jim A []
Inviato: luned́ 8 gennaio 2001 19.15
A: ''
Oggetto: [ectf-tgasr] New speech specs from W3C

You are invited to review and comment on the following W3C Speech Interface
Framework documents authored by the W3C Voice Browser Working Group. 
Last call working drafts of the Speech Synthesis markup langauge:  The Speech Synthesis Markup Language
Specification is part of this set of new markup specifications for voice
browsers, and is designed to provide a rich, XML-based markup language for
assisting the generation of synthetic speech in web and other applications.
The essential role of the markup language is to provide authors of
synthesizable content a standard way to control aspects of speech such as
pronunciation, volume, pitch, rate and etc. across different
synthesis-capable platforms
Last call working draft of the Speech Grammar markup language:  This document defines syntax for
representating grammars for use in speech recognition so that developers can
specify the words and patterns of words to be listened for by a speech
recognizer. The syntax of the grammar format is presented in two forms, an
augmented BNF syntax and an XML syntax. The specification intends to make
the two representations directly mappable and allow automatic
transformations between the two forms.
Working draft of the the Stochastic Language Models (N-Gram) Specification:  This document defines syntax for
representing N-Gram (Markovian) stochastic grammars within the W3C Speech
Interface Framework. The use of stochastic N-Gram models has a long and
successful history in the research community and is now more and more
effecting commercial systems, as the market asks for more robust and
flexible solutions. The primary purpose of specifying a stochastic grammar
format is to support large vocabulary and open vocabulary applications. 
We encourage you to subscribe to the public discussion list
<> and to mail in your comments before January 31, 2001. To
subscribe, send an email to www-voice-request@w3. org
<> with the word subscribe in the subject
line (include the word unsubscribe if you want to unsubscribe). A public
archive <> is available
Jim Larson 
W3C Voice Browser Working Group

Received on Friday, 8 August 2003 19:52:08 UTC