Comments on SSML Draft from Andrew Thompson on 2003-01-22 (www-voice@w3.org from January to March 2003)

From: Andrew Thompson <lordpixel@mac.com>
Date: Tue, 21 Jan 2003 20:49:28 -0500
To: www-voice@w3.org
Message-Id: <BE15C2F1-2DAB-11D7-B3A7-000A27D7D9DC@mac.com>
On Tuesday, Jan 21, 2003, at 04:26 America/New_York, Marc Schroeder 
wrote:

>
> Hi,
>
> this is a minor comment regarding the SSML <break> element 
> (http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/#S2.2.3), more 
> specifically regarding the meaning of the attribute value "none" for 
> the time attribute.

Which reminds me to send my comments in!
On the off chance anyone is aware that I'm part of the working group 
for JSR 113 (Java Speech API 2.0) I should make this clear that these 
are my personal comments, not those of that working group as a whole.

2.1.6 Sub Element

Does the table presented in this section have unintentional duplicates? 
If not, it would be helpful to explain the difference between:

"interpret-as: number format: ordinal" and the later

"interpret-as: ordinal"

This seem to be two ways of specifying the same functionality?

2.2.1 Voice Element

name attribute: No whitespace in the name seems overly restrictive - 
why not just comma separate the list of names as with font-face is CSS? 
The voice names are implementation dependent, therefore if whitespace 
is not allowed the SSML implementor will potentially have to map native 
voice names to SSML voice names, which seems to make SSML harder to use 
for developers (and possibly users).

variant attribute: Variant is defined as an integer. The spec states 
"eg, the second or next male child voice" but it does not specify how 
to express "next" as an integer. Would this be "+1" for next and "-1" 
for previous, or something else?

Relating to  this point, in general I have found it useful to be able 
to ask for voices like this: "give me an adult male voice, which must 
not be the same as the current voice". This can be used to implement 
"barge-in" type functionality. It might be worthwhile considering 
adding another attribute "exclude", in this fashion

<voice gender="male" age="30" exclude="bruce, fred">

"current" could then be a special voice name:

<voice gender="male" age="30" exclude="current"> - give me any adult 
male voice so long as its not the same as the current voice. This 
allows one to specify a similar voice in a more natural way than 
relying on the proposed "variant" attribute. The value of "variant" is 
a simple integer index and would be vendor specific anyway. "Exclude" 
would also make sense if a future SSML spec defines some standard voice 
names with well known characteristics.

2.2.3 Break element

time attribute: The value of "none" seems troublesome to me, if I read

<break time="none">

in a document, I would assume it meant "do not place a break between 
these elements" (break of length 0 seconds).
The spec defines 'The value "none" indicates that a normal break 
boundary should be used. The other five values indicate increasingly 
large break boundaries between words.'

I'd prefer <break time="default"> for this functionality. It seems more 
natural, and is more consistent with usage in 'section 2.2.4 prosody'. 
"none" could be retained, and mean "a short (ideally zero length) 
break", if the group feels engines can support that.

SEE ALSO: my comment on Appendix A below.

3.3 Pronunciation Lexicon

On the question of element specific lexicons raised in the document, I 
note one could use say-as as a limited way of having element specific 
pronunciation, eg,
<say-as interpret-as="lexiconKey" lexicon="british.file">tomato</say-as>

Of course, this is is really just another way of achieving what the 
<phoneme> element does.

My general concern about element specific lexicons is the processing 
cost. eg, assume the document as a whole has a lexicon in use (A), and 
a sub element specifies a new lexicon (B). Presumably the synthesis 
engine must perform lookups as if (A) and (B) are merged,  overriding 
pronunciations which occur in A with those in B. It then needs to 
unload (B) when the element is exited. This sounds like it could prove 
too costly for a handheld device (PDA, Cellphone), and indeed, even a 
desktop system might struggle to change lexicon every other word.

At the very least I think this feature would have to be implemented 
with no more granularity than per <paragraph> element. <sentence> seems 
too fine grained.

Appendix A: Example SSML

The first example has:

<sentence>The step is from Stephanie Williams and arrived at 
<break/>3:45</sentence>

The time attribute is optional on <break>, but section 2.2.3 does not 
specify what the default value for the "time" attribute is when it is 
not specified. If the default value is "none" then the break used is 
the normal word break length, which is not what the example above 
implies, it implies something longer than a normal break. SEE ALSO my 
comment on <break> above.

Thanks!

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

         (see you later space cowboy ...)
Received on Tuesday, 21 January 2003 20:49:29 UTC