A Response to the W3C Draft Proposal for a Speech Synthesis Mark-up Language

from COST 258, European Co-Operative Action on Improving the Naturalness of Speech Synthesis

(http://www.unil.ch/imm/docs/LAIP/COST_258/cost258.htm)

Editor: Alex Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com)


Background

COST 258 is a consortium of European speech synthesis experts from 17 countries.
It is funded by the European Commission, and its aim is to promote co-operative research
to improve the naturalness of synthetic speech. Its members come from both academic and
industrial R&D centres, including at least five providers of comercial speech synthesis
systems. For more information, see the website given above.

The W3C proposal was discussed at a meeting of COST 258 in September 2000. The present
document collates the subsequent reactions and responses from members.
It makes both general and specific points about the proposal, and suggests several
modifications. While we are broadly sympathetic to, and supportive of, the attempt to
standardise speech synthesis markup and to increase consistency across different
synthesisers, we feel that there are many obstacles to such an attempt and that some of
these obstacles are currently insurmountable.


General Points

1 - It is not clear who the intended users of this markup language are. There are two
obvious types of possible users: speech synthesis system developers, and application
developers. The former may well be concerned with low-level details of timing, pitch and
pronunciation, and be able to specify these details (F0 targets, phonetic
transcriptions, pause durations, etc.). The latter group are much more likely to be
concerned with specifying higher-level notions such as levels of boundary, degrees of
emphasis, fast vs slow speech rate, and formal vs casual pronunciation.

2 - It is clear that the proposal includes two, possibly three, different levels of
markup. For F0, for instance, there is the <emphasis> tag (which would be realised as a
pitch excursion in most systems), the <prosody contour> tag which allows finer control,
and the low-level <pitch> tag which is a proposed extension. There is very little
indication of best practice in the use of these different levels (e.g. which type of
user should use which level), and no explanation of what should happen if the different
levels are combined (e.g. a <pitch contour> specification inside an <emphasis>
environment).

3 - The notion of "non-markup behavior" is confusing. On the one hand, there seems to be
an assumption that markup will not affect the behaviour of the system outside the tags,
and that the markup therefore complements the system's unmarked performance, but on the
other hand there are references to "over-riding" the system's default behaviour. In
general, it is unclear whether markup is intende to be superimposed on the default
behaviour or to provide information which modifies that behaviour.

The use of the <break> element, for instance, is apparently intended "to override the
typical automatic behavior", but the insertion of a <break> tag may have non-local
repercussions which are very hard to predict. Take a system which assigns prosodic
boundaries stochastically, and attempts to balance the number and length of units at
each prosodic level. The "non-markup behavior" of such a system might take the input
"Big fat cigars, lots of money." and produce two balanced units: but will the input
"Big fat <break/> cigars, lots of money." produce three unbalanced units (big fat,
cigars, lots of money), or three more balanced units (big fat, cigars lots, of money),
or four balanced units (big fat, cigars, lots of, money), or six single-word units, or
something else? Which would be the correct interpretation of the markup?

4 - Many of the tags related to F0 presuppose that pitch is represented as a linear
sequence of targets. This is the case for some synthesisers, particularly those using
theories of intonation based on the work of Bruce, Ladd or Pierrehumbert. However, the
equally well-known Fujisaki approach is also commonly used in synthesis systems, as are
techniques involving the concetenation of natural or stylised F0 contours: in these
approaches, notions such as pitch targets, baselines and ranges have very different
meanings and in some cases no meaning at all. The current proposal is thus far from
theory-neutral, and is not implementable in many current synthesisers.

5 - The current draft does not make clear what will be in "the standard" and what will
be optional or future extensions. The <lowlevel> tag is the most obvious example, but
various other tags mentioned above are not universally implementable and would therefore
prevent many systems from complying with the standard.

6 - There is no provision for local or language-specific additions, such as different
classes of abbreviations (e.g. the distinction between a true acronym such as DEC and an
abbreviation such as NEC), different types of numbers (animate versus inanimate in many
languages), or the prosodic systems of tone languages. Some specific examples are
discussed below, but provision for anything other than English is minimal in the current
proposal.


Specific Tags

<sayas> - Several categories could be added to this tag, including telephone numbers,
credit card numbers, and the distinction between acronyms (DEC, DARPA, NASA) and
letter-by-letter abbreviations (USA, IBM, UK).

In languages with well-developed morphology, such as Finnish or Spanish, the
pronunciation of numbers and abbreviations depends not only on whether they are ordinal
or cardinal but also on their gender, case and even semantic properties. These are often
not explicit, or even predictable, from the text. It would be advisable to extend the
<sayas> tag to include an optional "morph" value to hold such information.

<voice> - It seems unnecessary to reset all prosodic aspects to their defaults when the
voice changes. This prevents the natural-sounding incorporation of direct speech using a
different voice, and also makes the reading of bilingual texts (common in Switzerland,
Eastern Europe, the Southern USA, and other exotic places) very awkward. ALthough
absolute values cannot be carried over from voice to voice, it should be possible to
tranfer relative values (slow/fast, high/medium/low, etc.) quite easily.

<break> - Some languages have a need for more levels of prosodic boundary below a minor
pause, and some applications may require boundaries above the paragraph level. It would
be advisable to add an optional "special" value for these cases.

<prosody> - There is currently no provision for languages with lexical tone. These
include many commercially important languages (e.g. Chinese, Swedish, Norwegian), as
well as most of the other languages of the world.

<rate> - "Words per minute" values are not reliably implementable in any current
synthesiser, although they may be a readily understandable measure of approximate speech
rate. It is perhaps equally important to be able to specify the dynamics of speech rate
- accelerations, decelerations, constancies.

<audio> - Multimodal systems (e.g. animations) are likely to require precise
synchronisation of audio, images and other resources. This may be beyond the scope of
the proposed standard, but could be included in the <lowlevel> tag.


Suggested Modifications


1 - Distinguish clearly between tags intended for speech synthesis developers and tags
intended for application designers. Perhaps two separate markup languages (high-level
and low-level) should be specified.

2 - Clarify the intended resolution of conflicts between high-level and low-level
markup, or explain the dangers of using both types in the same document.

3 - Clarify the intended effect of tags on the default behaviour of synthesis systems.
Should they be processed BEFORE the system performs its "non-markup behavior", or AFTER
the default output has been calculated? Does this vary depending on the tag?

4 - Revise the F0 tags to allow for theory-neutral interpretation: if this is bnot done,
the goal of interoperability across synthesis platforms cannot be achieved.

5 - Clarify what constitutes the minimum core standard, and what are optional or future
extensions.

6 - Provide a mechanism for extending the standard to include unforeseen cases,
particularly language-specific or multilingual requirements.


<sayas> - Add the categories mentioned above, plus an optional "morph" value to hold
agreement information.

<voice> - Allow the option of retaining relative prosodic attributes (pitch, rate, etc.)
when the voice is changed.

<break> - Add an optional "special" value to allow language-specific and application-
specific extensions.

<prosody> - add an optional "tone" value.

<audio> - Consider a <lowlevel> extension to allow synchronisation of speech with other
resources.