RE: W3C speech synth mark-up - the COST 258 response

I'd like to pick up on some of the points made by Mark Walker, 
in response to Alex Monaghans comments.

Firstly a little backgound.
We have been using Text to Speech for about 18 months,
to produce alternative media for visually impaired customers.
We have learned over that time just what type of material
is suitable.
Our needs are:
XML source.
Ability to insert external audio files into the audio stream
(audible navigation points, tone bursts at 55 hz which are 
findable when tape is played fast forward).
Ability to add to a dictionary / word set those words which
the synth gets wrong.
Ability to id and have spoken correctly standard items such
as dates, acronyms etc.

A potential further use is for digital talking books.
With text and no audio, a customer may ask for a word to be spelled
out, or the whole book to be read. This has not been realised
to date. (See

Taking a few points from the email.

> 1 - It is not clear who the intended users of this markup 
> language are.
> There are two
> obvious types of possible users: speech synthesis system 
> developers, and
> application
> developers. 

We clearly fall into the application side. 
The grammar side of VoiceXML simply has me puzzled :-)

Because of this I would support the seperation of these two,
possibly following the example set by the XSL WG?

> ****Again, the current form of the specification was largely 
> developed in a
> vacuum of information on potential usage models.

I hope my process outline above provides a real life use case.
We are desperately seeking an alterative to Laureate, which
only goes half way there.
(Its new word addition is brilliant, except that it becomes
hard to include non English new words :-)

>   Does the 
> possibility of
> mixing high and low elements really represent an 
> **insurmoutable** barrier
> to supporting SSML?  Please provide more detail.

My response to this point is that for an application developer,
the go nogo decision is likely to be made on reading the spec.
If its clearly usable for a particular case, with examples provided
which align with a need, then its adoption is far more likely.
If half of the rec is not understandable, then ..... I think 
thats obvious.

> ****In this instance, a synthesis text author would 
> reasonably be expected
> to specify more precisely exactly what he/she intended for 
> the resulting
> prosodic units by adding a 'size' or a 'time' attribute to the 'break'
> markup element. 

We use silences to good effect, as user research has shown.
I'd love to see <break time="2S"/>
Another shortcoming of Laureate.


> ****I am open to proposals in this area, but there would need to be
> substantially more data on it's potential usefulness.  I 
> agree with Richard Sproat that successful utilization of such a tag might 
> require linguistic
> expertise that would not likely be possessed by portal and 
> web text authors
> who I believe constitute the majority of potential users of this
> specification. 

I'd beg to differ.
If I could introduce speech onto my web pages, hosted by a 
commercial isp, I would. If its 
a) cost effective.
b) Sufficiently easy to pick up (the application developer view again).

I hope you don't seriously underestimate the potential of SSML
and VoiceXML.

> <rate> - "Words per minute" values are not reliably 
> implementable in any
> current
> synthesiser, although they may be a readily understandable measure of
> approximate speech
> rate. It is perhaps equally important to be able to specify 
> the dynamics of
> speech rate
> - accelerations, decelerations, constancies.

My suggestion would be even simpler.
Provide a rate of 1 to 100, let the synth people interpret that
for their engines, and users select appropriately by experiment.
wpm is used by typists... isn't it?

Finally, I'm a long term user of XSLT, beginning to use XSL-FO.
On the main list for that topic, speech has never once been mentioned.
Having asked, I'm unaware of any examples, never mind implementations.

Given an XML source, what should my target be for synth output?

I really would like to understand the relationship between 
the audio properties of FO, SSML and CSS. 

It confuses the hell out of me :-)

Regards DaveP.

Received on Monday, 22 January 2001 06:48:10 UTC