RE: W3C speech synth mark-up - the COST 258 response from Daniel Burnett on 2003-08-09 (www-voice@w3.org from July to September 2003)

From: Daniel Burnett <burnett@nuance.com>
Date: Fri, 8 Aug 2003 17:03:06 -0700
To: <david.pawson@rnib.org.uk>
Cc: <www-voice@w3.org>
Message-ID: <ED834EE1FDD6C3468AB0F5569206E6E91AF301@MPB1EXCH02.nuance.com>
Dear Dave,

Thank you for your review of the SSML specification.  It's been two years,
but we thought it appropriate to send an official response as if you had
sent the comments today. 

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance. 

Again, thank you for your input.

-- Dan Burnett
Synthesis Team Leader, VBWG

[VBWG responses are embedded, preceded by '>>>']

-----Original Message-----
From: www-voice-request@w3.org [mailto:www-voice-request@w3.org]On
Behalf Of DPawson@rnib.org.uk
Sent: Monday, January 22, 2001 2:40 AM
To: www-voice@w3.org
Subject: RE: W3C speech synth mark-up - the COST 258 response


I'd like to pick up on some of the points made by Mark Walker, 
in response to Alex Monaghans comments.

Firstly a little backgound.
We have been using Text to Speech for about 18 months,
to produce alternative media for visually impaired customers.
We have learned over that time just what type of material
is suitable.
Our needs are:
XML source.
Ability to insert external audio files into the audio stream
(audible navigation points, tone bursts at 55 hz which are 
findable when tape is played fast forward).
Ability to add to a dictionary / word set those words which
the synth gets wrong.
Ability to id and have spoken correctly standard items such
as dates, acronyms etc.

>>> Proposed disposition for the above:  Some accepted, some rejected
>>> 
>>> SSML 1.0 is based on XML.
>>> It is possible to insert external audio files into the audio
>>> stream using the <audio> element.
>>> It is possible, via the <lexicon> element, to add to a lexicon
>>> those words which the synth gets wrong.
>>> We have removed the specification for interpretation hints for
>>> dates, etc. (part of the <say-as> element) but intend to reactivate
>>> that work as a separate activity when SSML 1.0 reaches the Candidate
>>> Recommendation stage. We will consider your suggestion "Ability to
>>> id and have spoken correctly standard items such as dates, acronyms
>>> etc." at that time.

A potential further use is for digital talking books.
With text and no audio, a customer may ask for a word to be spelled
out, or the whole book to be read. This has not been realised
to date. (See www.daisy.org)




Taking a few points from the email.






> 1 - It is not clear who the intended users of this markup 
> language are.
> There are two
> obvious types of possible users: speech synthesis system 
> developers, and
> application
> developers. 

We clearly fall into the application side. 
The grammar side of VoiceXML simply has me puzzled :-)

Because of this I would support the seperation of these two,
possibly following the example set by the XSL WG?


> ****Again, the current form of the specification was largely 
> developed in a
> vacuum of information on potential usage models.

I hope my process outline above provides a real life use case.
We are desperately seeking an alterative to Laureate, which
only goes half way there.
(Its new word addition is brilliant, except that it becomes
hard to include non English new words :-)


>   Does the 
> possibility of
> mixing high and low elements really represent an 
> **insurmoutable** barrier
> to supporting SSML?  Please provide more detail.

My response to this point is that for an application developer,
the go nogo decision is likely to be made on reading the spec.
If its clearly usable for a particular case, with examples provided
which align with a need, then its adoption is far more likely.
If half of the rec is not understandable, then ..... I think 
thats obvious.


> ****In this instance, a synthesis text author would 
> reasonably be expected
> to specify more precisely exactly what he/she intended for 
> the resulting
> prosodic units by adding a 'size' or a 'time' attribute to the 'break'
> markup element. 

We use silences to good effect, as user research has shown.
I'd love to see <break time="2S"/>
Another shortcoming of Laureate.

>>> Proposed disposition:  Accepted
>>> 
>>> This capability is in the most recent draft of the specification.

 

> ****I am open to proposals in this area, but there would need to be
> substantially more data on it's potential usefulness.  I 
> agree with Richard Sproat that successful utilization of such a tag might 
> require linguistic
> expertise that would not likely be possessed by portal and 
> web text authors
> who I believe constitute the majority of potential users of this
> specification. 

I'd beg to differ.
If I could introduce speech onto my web pages, hosted by a 
commercial isp, I would. If its 
a) cost effective.
b) Sufficiently easy to pick up (the application developer view again).

I hope you don't seriously underestimate the potential of SSML
and VoiceXML.




> <rate> - "Words per minute" values are not reliably 
> implementable in any
> current
> synthesiser, although they may be a readily understandable measure of
> approximate speech
> rate. It is perhaps equally important to be able to specify 
> the dynamics of
> speech rate
> - accelerations, decelerations, constancies.

My suggestion would be even simpler.
Provide a rate of 1 to 100, let the synth people interpret that
for their engines, and users select appropriately by experiment.
wpm is used by typists... isn't it?

>>> Proposed disposition:  Rejected
>>> 
>>> Because of the difficulty in accurately defining the meaning
>>> of words per minute, syllables per minute, or phonemes per
>>> minute across all possible languages, we have decided to
>>> replace such specification with a number that acts as a
>>> multiplier of the default rate. For example, a value of 1
>>> means a speaking rate equal to the default rate, a value of
>>> 2 means a speaking rate twice the default rate, and a value
>>> of 0.5 means a speaking rate of half the default rate. The
>>> default rate is processor- specific and will usually vary
>>> across both languages and voices. Percentage changes relative
>>> to the current rate are still permitted. Note that the effect
>>> of setting a specific words per minute rate (for languages for
>>> which that makes sense) can be achieved by explicitly setting
>>> the duration for the contained text via the duration attribute
>>> of the <prosody> element. The duration attribute can be used in
>>> this way for all languages and is therefore the preferred way of
>>> precisely controlling the rate of speech when that is desired.
>>> This approach differs notably from your suggestion in that there
>>> is no maximum rate value. If this particular feature (maximum
>>> rate value) is important for you, could you provide some sample
>>> use cases?

Finally, I'm a long term user of XSLT, beginning to use XSL-FO.
On the main list for that topic, speech has never once been mentioned.
Having asked, I'm unaware of any examples, never mind implementations.

Given an XML source, what should my target be for synth output?
XSL-FO, SSML, html +ACSS?

I really would like to understand the relationship between 
the audio properties of FO, SSML and CSS. 

It confuses the hell out of me :-)

>>> ACSS is now being designed to produce SSML, so either format is
>>> fine to use.  ACSS provides such capabilities in terms of styling,
>>> while SSML allows you to directly output whatever text you would
>>> like and with as much control as you would like.  There is no
>>> official relationship between XSL-FO and SSML, although formatting
>>> objects can be used to produce aural CSS styling in some form (which
>>> may ultimately render to SSML if defined to do so).

Regards DaveP.
Received on Friday, 8 August 2003 20:11:49 UTC