A
Response to the W3C Draft
Proposal for a Speech Synthesis Markup Language
from COST
258, European Co-Operative Action on Improving the Naturalness of Speech
Synthesis
Editor: Alex
Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com)
Background
COST 258 is a consortium
of European speech synthesis experts from 17 countries. Funded by the European
Commission, its aim is to promote co-operative research to improve the
naturalness of synthetic speech. Its members come from both academic and
industrial R&D centres, including at least five providers of comercial
speech synthesis systems.
The W3C proposal was
discussed at a meeting of COST 258 in September 2000. The present document
collates the subsequent reactions and responses from members. It makes
both general and specific points about the proposal, and suggests several
modifications. While we are broadly sympathetic to, and supportive of,
the attempt to standardise speech synthesis markup and to increase consistency
across different synthesisers, we feel that there are many obstacles to
such an attempt and that some of these obstacles are currently insurmountable.
General Points
-
It is not clear who the
intended users of this markup language are. There are two obvious types
of possible users: speech synthesis system developers, and application
developers. The former may well be concerned with low-level details of
timing, pitch and pronunciation, and be able to specify these details (F0
targets, phonetic transcriptions, pause durations, etc.). The latter group
are much more likely to be concerned with specifying higher-level notions
such as levels of boundary, degrees of emphasis, fast vs slow speech rate,
and formal vs casual pronunciation. The proposal appears to be aimed at
both groups, but no indication is given as to which aspects of the markup
language are intended for which group.
It is clear that the
proposal includes two, and in some cases three, different levels of markup.
For F0, for instance, there is the <emphasis>
tag (which would be realised as a pitch excursion in most systems), the
<prosody
contour>
tag which allows finer control, and the low-level <pitch>
tag which is proposed as a future extension. There is very little indication
of best practice in the use of these different levels (e.g. which type
of user should use which level), and no explanation of what should happen
if the different levels are combined (e.g. a <pitch
contour> specification
inside an <emphasis>
environment).
We strongly suggest
that some distinction between high-level markup (specifying the function
or structure ofthe input) and low-level markup (specifying the form of
the output) be introduced, ideally by providing two explicit markup sublanguages.
The users of these sublanguages are unlikely to overlap. Moreover, while
most synthesisers might support one level of markup or the other, there
are currently very few synthesisers which could support both.
-
The notion of "non-markup
behavior" is confusing. On the one hand, there seems to be an assumption
that markup will not affect the behaviour of the system outside the tags,
and that the markup therefore complements the system's unmarked performance,
but on the other hand there are references to "over-riding" the system's
default behaviour. In general, it is unclear whether markup is intended
to be superimposed on the default behaviour or to provide information which
modifies that behaviour. The use of the <break>
element, for instance, is apparently intended "to override the typical
automatic behavior", but the insertion of a <break>
tag may have non-local repercussions which are very hard to predict. Take
a system which assigns prosodic boundaries stochastically, and attempts
to balance the number and length of units at each prosodic level. The "non-markup
behavior" of such a system might take the input "Big fat cigars, lots of
money." and produce two balanced units: but will the input "Big
fat <break/> cigars, lots of money."
produce three unbalanced units (big fat, cigars, lots of money), or three
more balanced units (big fat, cigars lots, of money), or four balanced
units (big fat, cigars, lots of, money), or six single-word units, or something
else? Which would be the correct interpretation of the markup?
-
Many of the tags related
to F0 presuppose that pitch is represented as a linear sequence of targets.
This is the case for some synthesisers, particularly those using theories
of intonation based on the work of Bruce, Ladd or Pierrehumbert. However,
the equally well-known Fujisaki approach is also commonly used in synthesis
systems, as are techniques involving the concatenation of natural or stylised
F0 contours: in these approaches, notions such as pitch targets, baselines
and ranges have very different meanings and in some cases no meaning at
all. The current proposal is thus far from theory-neutral, and is not implementable
in many current synthesisers.
-
There is no provision
for local or language-specific additions, such as different classes of
abbreviations (e.g. the distinction between a true acronym such as DEC
and an abbreviation such as NEC), different types of numbers (animate versus
inanimate in many languages), or the prosodic systems of tone languages.
Some specific examples are discussed below, but provision for anything
other than English is minimal in the current proposal. As compliant systems
extend their language coverage, they should be able to add the required
markup in a standard way, even if it has not been foreseen by the W3C.
Specific Tags
<say-as>
-
Several categories could
be added to this tag, including credit card numbers (normally read in groups)
and the distinction between acronyms (DEC, DARPA, NASA) and letter-by-letter
abbreviations (USA, IBM, UK).
-
In languages with well-developed
morphology, such as Finnish or Spanish, the pronunciation of numbers and
abbreviations depends not only on whether they are ordinal or cardinal
but also on their gender, case and even semantic properties. These are
often
not explicit, or even predictable, from the text. It would be advisable
to extend the <sayas>
tag to include an optional attribute to hold such information.
<voice>
-
It seems unnecessary to
reset all prosodic aspects to their defaults when the voice changes. This
prevents the natural-sounding incorporation of direct speech using a different
voice, and also makes the reading of bilingual texts (common in Switzerland,
Eastern Europe, the Southern USA, and other exotic places) very awkward.
Although absolute values cannot be carried over from voice to voice, it
should be possible to transfer relative values (slow/fast, high/medium/low,
etc.) quite easily.
<break>
-
Some languages have a
need for more levels of prosodic boundary below a minor pause, and some
applications may require boundaries above the paragraph level. It would
be advisable to add an optional "special" value for these cases.
<prosody>
-
There is currently no
provision for languages with lexical tone. These include many commercially
important languages (e.g. Chinese, Swedish, Norwegian), as well as most
of the other languages of the world. Although tone can be specified in
a full IPA transcription, the ability to specify tone alongside the orthography
would be very useful.
<rate>
-
There is currently no
unit of measurement for this tag. The "Words per minute" values suggested
in the previous draft were at least a readily understandable measure of
approximate speech rate. If their approximate nature were made explicit,
these could function as indicative values and would be implementable in
all synthesisers.
-
It is equally important
to be able to specify the dynamics of speech rate - accelerations, decelerations,
constancies. These are not mentioned in the current proposal.
<audio>
-
Multimodal systems (e.g.
animations) are likely to require precise synchronisation of audio, images
and other resources. This may be beyond the scope of the proposed standard,
but could be included in the <lowlevel> tag.
Suggested Modifications
-
Distinguish clearly between
tags intended for speech synthesis developers and tags intended for application
designers. Perhaps two separate markup languages (high-level and low-level)
should be specified. This would have the desirable side-effect of allowing
a synthesiser to comply with only one level of markup, depending on the
intended users.
-
Clarify the intended resolution
of conflicts between high-level and low-level markup, or explain the dangers
of using both types in the same document. This would be simpler if there
were two distinct levels of markup.
-
Clarify the intended effect
of tags on the default behaviour of synthesis systems. Should they be processed
BEFORE the system performs its "non-markup behavior", or AFTER the default
output has been calculated? Does this vary depending on the tag? Again,
this may be resolved by introducing two distinct levels of markup.
-
Revise the F0 tags to
allow for theory-neutral interpretation: if this is not done, the goal
of interoperability across synthesis platforms cannot be achieved.
-
Provide a mechanism for
extending the standard to include unforeseen cases, particularly language-specific
or multilingual requirements.
<say-as>
-
Add the categories mentioned
above, plus an optional attribute to hold agreement information.
<voice>
-
Allow the option of retaining
relative prosodic attributes (pitch, rate, etc.) when the voice is changed.
<break>
-
Add an optional "special"
attribute to allow language-specific and application-specific extensions.
<prosody>
-
Add an optional "tone"
attribute.
<audio>
-
Consider a <lowlevel>
extension to allow synchronisation of speech with other resources.