RE: Speech Synthesis markup language - COST258 comments

Dear Alex,

Thank you for your review of the SSML specification.  It's been two years,
but we thought it appropriate to send an official response as if you had
sent the comments today.  We have reproduced your text below, but in a
somewhat different arrangement in order to group together related comments.

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance.  If you
believe you will need more time than this for your review, we would
appreciate an estimate of how much time you will need.

Again, thank you for your input.  Please feel free to forward this to
others as you see fit.

-- Dan Burnett
Synthesis Team Leader, VBWG

[VBWG responses are embedded, preceded by '>>>']

-----Original Message-----
From: []On
Behalf Of
Sent: Friday, January 19, 2001 1:31 AM
To:; <others>
Subject: Speech Synthesis markup language - COST258 comments

Dear colleagues,
Here are the comments of COST 258 on the proposed W3C Speech Synthesis
Markup Language. Tyhis is a revised version of the plain text draft which
was circulated in December.
We believe that the current proposal should be amended in various important
(but not necessarily major) respects in order to fulfil its aims of
multilinguality, interoperability and implementability in current
synthesisers. We also believe that the formulation of such a universal
standard may be premature, and that the possibility of unforeseen needs or
difficulties in treating less commonly synthesised languages (e.g. Spanish
or Chinese) should be catered for in the current proposal.
The attached HTML file contains our detailed comments and suggestions.
With best wishes,
Alex Monaghan
(for COST 258)


1. It is not clear who the intended users of this markup language are.
 There are two obvious types of possible users: speech synthesis system
 developers, and application developers. The former may well be
 concerned with low-level details of timing, pitch and pronunciation,
 and be able to specify these details (F0 targets, phonetic transcriptions,
 pause durations, etc.). The latter group are much more likely to be
 concerned with specifying higher-level notions such as levels of boundary,
 degrees of emphasis, fast vs slow speech rate, and formal vs casual
 pronunciation. The proposal appears to be aimed at both groups, but no
 indication is given as to which aspects of the markup language are
 intended for which group.

 Distinguish clearly between tags intended for speech synthesis developers
 and tags intended for application designers.

>>> Proposed disposition:  Rejected
>>> We believe that all the tags are appropriate for and needed by
>>> application developers. Commercial deployments of SSML so far
>>> appear to have borne out this conclusion.

2. It is clear that the proposal includes two, and in some cases three,
 different levels of markup. For F0, for instance, there is the
 <emphasis> tag (which would be realised as a pitch excursion in most
 systems), the <prosody contour> tag which allows finer control, and
 the low-level <pitch> tag which is proposed as a future extension.
 There is very little indication of best practice in the use of these
 different levels (e.g. which type of user should use which level),
 and no explanation of what should happen if the different levels are
 combined (e.g. a <pitch contour> specification inside an <emphasis>

 Clarify the intended resolution of conflicts between high-level and
 low-level markup, or explain the dangers of using both types in the
 same document. This would be simpler if there were two distinct levels
 of markup.

>>> Proposed disposition:  Accepted
>>> This is an excellent point. We will note the dangers as you suggest.
>>> We will also note that although the behaviors of the individual elements
>>> are specified, details about how conflicts are resolved are implementation
>>> specific.

3. We strongly suggest that some distinction between high-level markup
 (specifying the function or structure of the input) and low-level
 markup (specifying the form of the output) be introduced, ideally by
 providing two explicit markup sublanguages. The users of these
 sublanguages are unlikely to overlap. Moreover, while most synthesisers
 might support one level of markup or the other, there are currently
 very few synthesisers which could support both.

 Perhaps two separate markup languages (high-level and low-level) should
 be specified. This would have the desirable side- effect of allowing a
 synthesiser to comply with only one level of markup, depending on the
 intended users.

>>> Proposed disposition:  Rejected
>>> There are certainly complete implementations of SSML today that
>>> implement both high and low level tags. This separation is something
>>> we will consider for a later version of SSML (beyond 1.0). For this
>>> specification we will add a note that although the tags themselves
>>> may be supported, details of the interactions between the two levels
>>> are implementation specific. We will encourage developers to use caution
>>> in mixing them arbitrarily.

4. The notion of "non-markup behavior" is confusing. On the one hand,
 there seems to be an assumption that markup will not affect the
 behaviour of the system outside the tags, and that the markup therefore
 complements the system's unmarked performance, but on the other hand
 there are references to "over-riding" the system's default behaviour.
 In general, it is unclear whether markup is intended to be superimposed
 on the default behaviour or to provide information which modifies that
 behaviour. The use of the <break> element, for instance, is apparently
 intended "to override the typical automatic behavior", but the insertion
 of a <break> tag may have non-local repercussions which are very hard to
 predict. Take a system which assigns prosodic boundaries stochastically,
 and attempts to balance the number and length of units at each prosodic
 level. The "non-markup behavior" of such a system might take the input
 "Big fat cigars, lots of money." and produce two balanced units: but will
 the input "Big fat <break/> cigars, lots of money." produce three
 unbalanced units (big fat, cigars, lots of money), or three more balanced
 units (big fat, cigars lots, of money), or four balanced units (big fat,
 cigars, lots of, money), or six single-word units, or something else?
 Which would be the correct interpretation of the markup?

 Clarify the intended effect of tags on the default behaviour of synthesis
 systems. Should they be processed BEFORE the system performs its
 "non-markup behavior", or AFTER the default output has been calculated?
 Does this vary depending on the tag? Again, this may be resolved by
 introducing two distinct levels of markup.

>>> Proposed disposition:  Accepted with changes
>>> This is a good point. As you surmised, the behavior does vary
>>> depending on the tag, largely because the processor has the ultimate
>>> authority to ensure that what it produces is pronounceable (and
>>> ideally intelligible). In general the markup provides a way for the
>>> author to make prosodic and other information available to the
>>> processor, typically information the processor would be unable to
>>> acquire on its own. It is up to the processor to determine whether
>>> and in what way to use the information.
>>> We will provide some additional text to clarify this behavior.

5. Many of the tags related to F0 presuppose that pitch is represented
 as a linear sequence of targets. This is the case for some synthesisers,
 particularly those using theories of intonation based on the work of
 Bruce, Ladd or Pierrehumbert. However, the equally well-known Fujisaki
 approach is also commonly used in synthesis systems, as are techniques
 involving the concatenation of natural or stylised F0 contours: in these
 approaches, notions such as pitch targets, baselines and ranges have
 very different meanings and in some cases no meaning at all. The current
 proposal is thus far from theory- neutral, and is not implementable in
 many current synthesisers.

 Revise the F0 tags to allow for theory-neutral interpretation: if this
 is not done, the goal of interoperability across synthesis platforms
 cannot be achieved.

>>> Proposed disposition:  Rejected
>>> It is outside the scope of this group to design a theory-neutral
>>> approach. We are not aware of the existence of such an approach, and
>>> so far in commercial systems we have seen considerable support for the
>>> current approach. There is also no requirement within the specification
>>> that any of the theories you mention be used in implementation. Rather,
>>> F0 variation is expressed in terms of pitch targets but can be mapped
>>> into any underlying model the processor wishes.

6. There is no provision for local or language-specific additions, such
 as different classes of abbreviations (e.g. the distinction between a
 true acronym such as DEC and an abbreviation such as NEC), different
 types of numbers (animate versus inanimate in many languages), or the
 prosodic systems of tone languages. Some specific examples are discussed
 below, but provision for anything other than English is minimal in the
 current proposal. As compliant systems extend their language coverage,
 they should be able to add the required markup in a standard way, even
 if it has not been foreseen by the W3C.

 Provide a mechanism for extending the standard to include unforeseen cases,
 particularly language-specific or multilingual requirements.

>>> Proposed disposition:  Rejected
>>> It is difficult, if not impossible, to incorporate a generic mechanism
>>> that will work for all of the language features you're describing, in
>>> addition to unforseen features, in a standard manner. It may be
>>> possible to have extensions to the specification later on as we
>>> discover standardized ways to provide the information you suggest.
>>> We welcome your input for such future extensions.

7. <say-as>: Several categories could be added to this tag, including
 credit card numbers (normally read in groups) and the distinction
 between acronyms (DEC, DARPA, NASA) and letter-by-letter abbreviations
 (USA, IBM, UK).

 Add the categories mentioned above.

>>> Proposed disposition:  Rejected
>>> These are good suggestions. However, we have removed all attribute
>>> values and their definitions from the <say-as> element. To avoid
>>> inappropriate assumptions about what is specified, we will also be
>>> removing the examples from the <say- as> section. We expect to begin
>>> work on specifying the details of the <say-as> element when SSML 1.0
>>> reaches the Candidate Recommendation stage. We will consider your
>>> suggestions at that time.

8. In languages with well-developed morphology, such as Finnish or Spanish,
 the pronunciation of numbers and abbreviations depends not only on whether
 they are ordinal or cardinal but also on their gender, case and even
 semantic properties. These are often not explicit, or even predictable,
 from the text. It would be advisable to extend the <sayas> tag to include
 an optional attribute to hold such information.

>>> Proposed disposition:  Rejected
>>> We are aware of this issue and have considered it again in response
>>> to your input, but we are not prepared to address it at this time.
>>> As you point out, there is broad variability in the categories and
>>> structure of this information. The <say-as> element is only designed
>>> to indicate simple structure for cases where the synthesis processor
>>> is unable to determine it on its own. Where large amounts of context-
>>> dependent information would be required in order to adequately inform
>>> the processor, we would recommend not using the <say-as> element at all.
>>> Rather, we recommend that numbers and abbreviations be instead written
>>> out orthographically, as is possible with any text over which the
>>> application writer wishes absolute control.

9. <voice> element: It seems unnecessary to reset all prosodic aspects
 to their defaults when the voice changes. This prevents the natural-
 sounding incorporation of direct speech using a different voice, and
 also makes the reading of bilingual texts (common in Switzerland, Eastern
 Europe, the Southern USA, and other exotic places) very awkward. Although
 absolute values cannot be carried over from voice to voice, it should
 be possible to transfer relative values (slow/fast, high/medium/ low,
 etc.) quite easily.

 Allow the option of retaining relative prosodic attributes (pitch, rate,
 etc.) when the voice is changed.

>>> Proposed Resolution:  Accepted with changes
>>> We agree in principle with your suggestion. We will remove the
>>> contentious paragraph and replace it with one explaining that
>>> o relative changes in prosodic parameters are expected to be carried
>>>   across voice changes, but
>>> o different voices have different natural defaults for pitch, speaking
>>>   rate, etc. because they represent different personalities, so
>>> o absolute values of the prosodic parameters may vary across changes in
>>>   the voice.

10. <break> element: Some languages have a need for more levels of
 prosodic boundary below a minor pause, and some applications may require
 boundaries above the paragraph level. It would be advisable to add an
 optional "special" value for these cases.

 Add an optional "special" attribute to allow language-specific and
 application-specific extensions.

>>> Proposed disposition:  Rejected
>>> This is a good suggestion, but it is too extensive to add to the
>>> specification at this time. This feature will be deferred to the
>>> next version of SSML.

11. <prosody> element: There is currently no provision for languages
 with lexical tone. These include many commercially important languages
 (e.g. Chinese, Swedish, Norwegian), as well as most of the other
 languages of the world. Although tone can be specified in a full IPA
 transcription, the ability to specify tone alongside the orthography
 would be very useful.

 Add an optional "tone" attribute.

>>> Proposed disposition:  (none yet)
>>> It is unclear how you would expect this to work. As you point out,
>>> this can be specified in full IPA, which is possible with the phoneme
>>> element today.
>>> How would you envision specifying tone *alongside* the orthography?

12. <rate> element: There is currently no unit of measurement for this tag.
 The "Words per minute" values suggested in the previous draft were at least
 a readily understandable measure of approximate speech rate. If their
 approximate nature were made explicit, these could function as indicative
 values and would be implementable in all synthesisers.

>>> Proposed disposition:  Rejected
>>> Because of the difficulty in accurately defining the meaning of words
>>> per minute, syllables per minute, or phonemes per minute across all
>>> possible languages, we have decided to replace such specification with
>>> a number that acts as a multiplier of the default rate. For example,
>>> a value of 1 means a speaking rate equal to the default rate, a value
>>> of 2 means a speaking rate twice the default rate, and a value of 0.5
>>> means a speaking rate of half the default rate. The default rate is
>>> processor-specific and will usually vary across both languages and
>>> voices. Percentage changes relative to the current rate are still
>>> permitted. Note that the effect of setting a specific words per minute
>>> rate (for languages for which that makes sense) can be achieved by
>>> explicitly setting the duration for the contained text via the
>>> duration attribute of the <prosody> element. The duration attribute
>>> can be used in this way for all languages and is therefore the
>>> preferred way of precisely controlling the rate of speech when that
>>> is desired.

13. <rate> element: It is equally important to be able to specify the
 dynamics of speech rate - accelerations, decelerations, constancies.
 These are not mentioned in the current proposal.

>>> Proposed disposition:  Rejected
>>> These are good suggestions, but they are too extensive to add to the
>>> specification at this time. These features will be deferred to the next
>>> version of SSML.

14. <audio> element: Multimodal systems (e.g. animations) are likely to
 require precise synchronisation of audio, images and other resources. This
 may be beyond the scope of the proposed standard, but could be included in
 the <lowlevel> tag.

 Consider a <lowlevel> extension to allow synchronisation of speech with other

>>> Proposed disposition:  Rejected
>>> As you suggest, this class of additions is outside the scope of the
>>> specification. We think it likely that other specifications such as
>>> SMIL would be more appropriate for this functionality. To the best
>>> of our knowledge, there are no major technical problems with
>>> integration of SMIL and SSML functionality.

Received on Friday, 8 August 2003 20:11:50 UTC