RE: Comments on SSML Draft

Dear Andrew,

Thank you for your review of the most recent SSML draft.  Our responses
are below.

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance.

Again, thank you for your input.

-- Dan Burnett
Synthesis Team Leader, VBWG

[VBWG responses are embedded, preceded by '>>>']

-----Original Message-----
From: Andrew Thompson []
Sent: Tuesday, January 21, 2003 5:49 PM
Subject: Comments on SSML Draft

On Tuesday, Jan 21, 2003, at 04:26 America/New_York, Marc Schroeder 

> Hi,
> this is a minor comment regarding the SSML <break> element 
> (, more 
> specifically regarding the meaning of the attribute value "none" for 
> the time attribute.

Which reminds me to send my comments in!
On the off chance anyone is aware that I'm part of the working group 
for JSR 113 (Java Speech API 2.0) I should make this clear that these 
are my personal comments, not those of that working group as a whole.

2.1.6 Sub Element

Does the table presented in this section have unintentional duplicates? 
If not, it would be helpful to explain the difference between:

"interpret-as: number format: ordinal" and the later

"interpret-as: ordinal"

This seem to be two ways of specifying the same functionality?

>>> Proposed disposition:  N/A
>>> Actually, this case in the examples in section 2.1.4 was to show
>>> that there are multiple ways this functionality might be specified
>>> (when specified at a later date). In any case, we will be removing
>>> these examples because they have led to confusion about whether or
>>> not the values of the attributes have been specified, which they haven't.

2.2.1 Voice Element

name attribute: No whitespace in the name seems overly restrictive - 
why not just comma separate the list of names as with font-face is CSS? 
The voice names are implementation dependent, therefore if whitespace 
is not allowed the SSML implementor will potentially have to map native 
voice names to SSML voice names, which seems to make SSML harder to use 
for developers (and possibly users).

>>> Proposed disposition:  Rejected
>>> We chose space-separated tokens for consistency with other
>>> XML specifications. The NMTOKENS data type, for example,
>>> commonly used in XML-based specifications Document Type
>>> Definitions, is a space-separated list of NMTOKEN. Using
>>> whitespace as the separator also simplifies XSLT style
>>> sheets operating on SSML.

variant attribute: Variant is defined as an integer. The spec states 
"eg, the second or next male child voice" but it does not specify how 
to express "next" as an integer. Would this be "+1" for next and "-1" 
for previous, or something else?

>>> Proposed disposition:  N/A
>>> We will clarify the specification to indicate that only
>>> positive integers are expected (without pluses or minuses).
>>> We will also remove "or next", since relative specifiers were
>>> never intended for this attribute.

Relating to  this point, in general I have found it useful to be able 
to ask for voices like this: "give me an adult male voice, which must 
not be the same as the current voice". This can be used to implement 
"barge-in" type functionality. It might be worthwhile considering 
adding another attribute "exclude", in this fashion

<voice gender="male" age="30" exclude="bruce, fred">

"current" could then be a special voice name:

<voice gender="male" age="30" exclude="current"> - give me any adult 
male voice so long as its not the same as the current voice. This 
allows one to specify a similar voice in a more natural way than 
relying on the proposed "variant" attribute. The value of "variant" is 
a simple integer index and would be vendor specific anyway. "Exclude" 
would also make sense if a future SSML spec defines some standard voice 
names with well known characteristics.

>>> Proposed disposition:  Rejected
>>> This is a great suggestion. We will be happy to consider this
>>> new feature for the next version of SSML (after 1.0).

2.2.3 Break element

time attribute: The value of "none" seems troublesome to me, if I read

<break time="none">

in a document, I would assume it meant "do not place a break between 
these elements" (break of length 0 seconds).
The spec defines 'The value "none" indicates that a normal break 
boundary should be used. The other five values indicate increasingly 
large break boundaries between words.'

I'd prefer <break time="default"> for this functionality. It seems more 
natural, and is more consistent with usage in 'section 2.2.4 prosody'. 
"none" could be retained, and mean "a short (ideally zero length) 
break", if the group feels engines can support that.

>>> Proposed disposition:  Partially accepted
>>> We will be reintroducing the distinction between break strength
>>> and time suggested by Alex Monaghan. The solution will also have the
>>> following characteristics: 
>>> o It will be possible to have a break of strength="none" to ensure
>>>   that no break occurs.
>>> o When only the strength attribute is present, the break time will
>>>   be based on the processor's interpretation of the strength value.
>>> o When only the time attribute is present, other prosodic strength
>>>   indicators may change at the discretion of the processor.
>>> o When neither attribute is present, the element will increase the
>>>   break strength from its current value.

SEE ALSO: my comment on Appendix A below.

3.3 Pronunciation Lexicon

On the question of element specific lexicons raised in the document, I 
note one could use say-as as a limited way of having element specific 
pronunciation, eg,
<say-as interpret-as="lexiconKey" lexicon="british.file">tomato</say-as>

Of course, this is is really just another way of achieving what the 
<phoneme> element does.

>>> Proposed disposition:  Rejected
>>> As you pointed out, this specific use case can be
>>> accomplished via the <phoneme> element.

My general concern about element specific lexicons is the processing 
cost. eg, assume the document as a whole has a lexicon in use (A), and 
a sub element specifies a new lexicon (B). Presumably the synthesis 
engine must perform lookups as if (A) and (B) are merged,  overriding 
pronunciations which occur in A with those in B. It then needs to 
unload (B) when the element is exited. This sounds like it could prove 
too costly for a handheld device (PDA, Cellphone), and indeed, even a 
desktop system might struggle to change lexicon every other word.

At the very least I think this feature would have to be implemented 
with no more granularity than per <paragraph> element. <sentence> seems 
too fine grained.

>>> Proposed disposition:  N/A
>>> Thank you for your feedback. After extensive discussion,
>>> we were unable to find sufficient use cases to warrant
>>> adding the lower-level lexicon functionality to the
>>> specification at this time.

Appendix A: Example SSML

The first example has:

<sentence>The step is from Stephanie Williams and arrived at 

The time attribute is optional on <break>, but section 2.2.3 does not 
specify what the default value for the "time" attribute is when it is 
not specified. If the default value is "none" then the break used is 
the normal word break length, which is not what the example above 
implies, it implies something longer than a normal break. SEE ALSO my 
comment on <break> above.

>>> Proposed disposition:  (see our response above to the <break>
>>>                         element suggestions)


AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

         (see you later space cowboy ...)

Received on Friday, 8 August 2003 20:11:53 UTC