A Response to the W3C Draft Proposal for a Speech Synthesis Markup Language

from COST 258, European Co-Operative Action on Improving the Naturalness of Speech Synthesis
 

Editor: Alex Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com)

 

 
 
 

Background

COST 258 is a consortium of European speech synthesis experts from 17 countries. Funded by the European Commission, its aim is to promote co-operative research to improve the naturalness of synthetic speech. Its members come from both academic and industrial R&D centres, including at least five providers of comercial speech synthesis systems.

The W3C proposal was discussed at a meeting of COST 258 in September 2000. The present document collates the subsequent reactions and responses from members. It makes both general and specific points about the proposal, and suggests several modifications. While we are broadly sympathetic to, and supportive of, the attempt to standardise speech synthesis markup and to increase consistency across different synthesisers, we feel that there are many obstacles to such an attempt and that some of these obstacles are currently insurmountable.
 

General Points

  1. It is not clear who the intended users of this markup language are. There are two obvious types of possible users: speech synthesis system developers, and application developers. The former may well be concerned with low-level details of timing, pitch and pronunciation, and be able to specify these details (F0 targets, phonetic transcriptions, pause durations, etc.). The latter group are much more likely to be concerned with specifying higher-level notions such as levels of boundary, degrees of emphasis, fast vs slow speech rate, and formal vs casual pronunciation. The proposal appears to be aimed at both groups, but no indication is given as to which aspects of the markup language are intended for which group.

  2. It is clear that the proposal includes two, and in some cases three, different levels of markup. For F0, for instance, there is the <emphasis> tag (which would be realised as a pitch excursion in most systems), the <prosody contour> tag which allows finer control, and the low-level <pitch> tag which is proposed as a future extension. There is very little indication of best practice in the use of these different levels (e.g. which type of user should use which level), and no explanation of what should happen if the different levels are combined (e.g. a <pitch contour> specification inside an <emphasis> environment).
    We strongly suggest that some distinction between high-level markup (specifying the function or structure ofthe input) and low-level markup (specifying the form of the output) be introduced, ideally by providing two explicit markup sublanguages. The users of these sublanguages are unlikely to overlap. Moreover, while most synthesisers might support one level of markup or the other, there are currently very few synthesisers which could support both.
  3. The notion of "non-markup behavior" is confusing. On the one hand, there seems to be an assumption that markup will not affect the behaviour of the system outside the tags, and that the markup therefore complements the system's unmarked performance, but on the other hand there are references to "over-riding" the system's default behaviour. In general, it is unclear whether markup is intended to be superimposed on the default behaviour or to provide information which modifies that behaviour. The use of the <break> element, for instance, is apparently intended "to override the typical automatic behavior", but the insertion of a <break> tag may have non-local repercussions which are very hard to predict. Take a system which assigns prosodic boundaries stochastically, and attempts to balance the number and length of units at each prosodic level. The "non-markup behavior" of such a system might take the input "Big fat cigars, lots of money." and produce two balanced units: but will the input "Big fat <break/> cigars, lots of money." produce three unbalanced units (big fat, cigars, lots of money), or three more balanced units (big fat, cigars lots, of money), or four balanced units (big fat, cigars, lots of, money), or six single-word units, or something else? Which would be the correct interpretation of the markup?
  4. Many of the tags related to F0 presuppose that pitch is represented as a linear sequence of targets. This is the case for some synthesisers, particularly those using theories of intonation based on the work of Bruce, Ladd or Pierrehumbert. However, the equally well-known Fujisaki approach is also commonly used in synthesis systems, as are techniques involving the concatenation of natural or stylised F0 contours: in these approaches, notions such as pitch targets, baselines and ranges have very different meanings and in some cases no meaning at all. The current proposal is thus far from theory-neutral, and is not implementable in many current synthesisers.
  5. There is no provision for local or language-specific additions, such as different classes of abbreviations (e.g. the distinction between a true acronym such as DEC and an abbreviation such as NEC), different types of numbers (animate versus inanimate in many languages), or the prosodic systems of tone languages. Some specific examples are discussed below, but provision for anything other than English is minimal in the current proposal. As compliant systems extend their language coverage, they should be able to add the required markup in a standard way, even if it has not been foreseen by the W3C.


Specific Tags

<say-as>

<voice> <break> <prosody> <rate> <audio>


Suggested Modifications

  1. Distinguish clearly between tags intended for speech synthesis developers and tags intended for application designers. Perhaps two separate markup languages (high-level and low-level) should be specified. This would have the desirable side-effect of allowing a synthesiser to comply with only one level of markup, depending on the intended users.
  2. Clarify the intended resolution of conflicts between high-level and low-level markup, or explain the dangers of using both types in the same document. This would be simpler if there were two distinct levels of markup.
  3. Clarify the intended effect of tags on the default behaviour of synthesis systems. Should they be processed BEFORE the system performs its "non-markup behavior", or AFTER the default output has been calculated? Does this vary depending on the tag? Again, this may be resolved by introducing two distinct levels of markup.
  4. Revise the F0 tags to allow for theory-neutral interpretation: if this is not done, the goal of interoperability across synthesis platforms cannot be achieved.
  5. Provide a mechanism for extending the standard to include unforeseen cases, particularly language-specific or multilingual requirements.
<say-as> <voice> <break> <prosody> <audio>