- From: Walker, Mark R <mark.r.walker@intel.com>
- Date: Fri, 19 Jan 2001 16:19:46 -0800
- To: "'Alex.Monaghan@Aculab.com'" <Alex.Monaghan@Aculab.com>
- Cc: "Walker, Mark R" <mark.r.walker@intel.com>, www-voice@w3.org
Alex- ****Attached below is a detailed response to your original remarks. Again, thank you and your COST 258 colleagues for taking the time to review and comment on the voice browser SSML. In responding, I have adopted some of the sugestions made by Richard Sproat. His comments and the COST 258 comments now form the bulk of changes now proposed for SSML during this last call phase. A Response to the W3C Draft Proposal for a Speech Synthesis Mark-up Language from COST 258, European Co-Operative Action on Improving the Naturalness of Speech Synthesis (http://www.unil.ch/imm/docs/LAIP/COST_258/cost258.htm) Editor: Alex Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com) Background COST 258 is a consortium of European speech synthesis experts from 17 countries. It is funded by the European Commission, and its aim is to promote co-operative research to improve the naturalness of synthetic speech. Its members come from both academic and industrial R&D centres, including at least five providers of comercial speech synthesis systems. For more information, see the website given above. The W3C proposal was discussed at a meeting of COST 258 in September 2000. The present document collates the subsequent reactions and responses from members. It makes both general and specific points about the proposal, and suggests several modifications. While we are broadly sympathetic to, and supportive of, the attempt to standardise speech synthesis markup and to increase consistency across different synthesisers, we feel that there are many obstacles to such an attempt and that some of these obstacles are currently insurmountable. General Points 1 - It is not clear who the intended users of this markup language are. There are two obvious types of possible users: speech synthesis system developers, and application developers. The former may well be concerned with low-level details of timing, pitch and pronunciation, and be able to specify these details (F0 targets, phonetic transcriptions, pause durations, etc.). The latter group are much more likely to be concerned with specifying higher-level notions such as levels of boundary, degrees of emphasis, fast vs slow speech rate, and formal vs casual pronunciation. ****The confusion regarding the intended users of SSML is on target. The specification of support for markup at different levels of the synthesis process has resulted in a markup language that, as Richard Sproat points out, contains element groups addressed at both expert and non-expert users. This was not by design however, but was simply the result of the voice browser committee deliberations. Having said that, we believe it is ultimately advantageous to have a range of elements that enable high, mid, and low-level control over synthesized output. ****As I mentioned in my previous letter, I believe it is unlikely that the low-level elements will ever be employed directly by application developers, even those developers that possess synthesis expertise. Low-level element sequences are most likely to be the output generated by SSML authoring tools employed by both expert and non-expert content authors in the preparation of synthesize-able material. It is not unreasonable to imagine that the tool users would use the high-level markup in the user interface, while 'compiling' the resulting text into XML pitch/phoneme sequences that closely represented the wishes of the author for the rendered output. This is obviously highly speculative, since there are currently no tools of this sort currently available. I can report however, that this usage scenario has been discussed within Intel and in other venues with other companies outside of W3C, and has generally been met with approval. 2 - It is clear that the proposal includes two, possibly three, different levels of markup. For F0, for instance, there is the <emphasis> tag (which would be realised as a pitch excursion in most systems), the <prosody contour> tag which allows finer control, and the low-level <pitch> tag which is a proposed extension. There is very little indication of best practice in the use of these different levels (e.g. which type of user should use which level), and no explanation of what should happen if the different levels are combined (e.g. a <pitch contour> specification inside an <emphasis> environment). ****Again, the current form of the specification was largely developed in a vacuum of information on potential usage models. As the specification moves into recommendation form and is slowly adopted within the speech application community, it is anticipated that best practice and common usage information will emerge and will be in incorporated in future revisions of the specification. It might be anticipated, based on my response to remark #1, that the inter-mxing of both high and low-level markup in a given application would be unlikely. I might also remark that you have suggested that this would be difficult, but not impossible. Does the possibility of mixing high and low elements really represent an **insurmoutable** barrier to supporting SSML? Please provide more detail. 3 - The notion of "non-markup behavior" is confusing. On the one hand, there seems to be an assumption that markup will not affect the behaviour of the system outside the tags, and that the markup therefore complements the system's unmarked performance, but on the other hand there are references to "over-riding" the system's default behaviour. In general, it is unclear whether markup is intende to be superimposed on the default behaviour or to provide information which modifies that behaviour. ****The effect of markup in supplanting the un-marked behavior of a given synthesizer obviously depends on the type and the particular markup level being specified. In some cases, default behavior will be completely supplanted, and in other cases markup will result in the superimposition of relative effects. The text-analysis guides like 'say-as' are relatively unambiguous in that they completely supplant the default synthesizer behavior. Specifying 'emphasis' however, will likely result in relative changes in the pitch, duration, and etc of the utterance that will manifest perceptually as an imposition. The use of the <break> element, for instance, is apparently intended "to override the typical automatic behavior", but the insertion of a <break> tag may have non-local repercussions which are very hard to predict. Take a system which assigns prosodic boundaries stochastically, and attempts to balance the number and length of units at each prosodic level. The "non-markup behavior" of such a system might take the input "Big fat cigars, lots of money." and produce two balanced units: but will the input "Big fat <break/> cigars, lots of money." produce three unbalanced units (big fat, cigars, lots of money), or three more balanced units (big fat, cigars lots, of money), or four balanced units (big fat, cigars, lots of, money), or six single-word units, or something else? Which would be the correct interpretation of the markup? ****In this instance, a synthesis text author would reasonably be expected to specify more precisely exactly what he/she intended for the resulting prosodic units by adding a 'size' or a 'time' attribute to the 'break' markup element. Having said that, I must say I have no idea what the common default behavior for the above utterance would be, and for that reason it would definitely NOT be logical to add markup to alter it. It seems reasonable that even a non-expert synthesis text author would utilize common sense and realize that specifying markup in sequences where default synthesizer behavior is highly variable would be a dicey proposition. ****However, in addressing your concern it seems highly likely that high-level elements like 'break' will result in highly variable output across different synthesizers, and that will seen be as normal and perfectly compliant behavior. Systems like the one you cite above might elect to 'balance' the effects of the <break/> insertion over a context larger than the specified interval. Others may elect to completely localize the effect, even to the detriment of the perceptual quality. Both behaviors would be compliant, and both would essentially convey the coarsely specified intent of the markup text author. I will place verbage within the spec that makes this point clear. 4 - Many of the tags related to F0 presuppose that pitch is represented as a linear sequence of targets. This is the case for some synthesisers, particularly those using theories of intonation based on the work of Bruce, Ladd or Pierrehumbert. However, the equally well-known Fujisaki approach is also commonly used in synthesis systems, as are techniques involving the concetenation of natural or stylised F0 contours: in these approaches, notions such as pitch targets, baselines and ranges have very different meanings and in some cases no meaning at all. The current proposal is thus far from theory-neutral, and is not implementable in many current synthesisers. ****This is a significant issue and one that I was completely unaware of until you raised it. Obviously, the early days of the SSML requirements phase were dominated (apparently) by firms possessing synthesizers modeling intonation with the former approach. I would welcome any proposal that expanded the ability of the low-level elements to specify intonation in a less theory-biased manner. ****In answering Richard Sproat's specific concern about long-unit synthesizers, I will propose that the decision by any synthesis engine provider to support the SSML specification is probably ultimately driven by economics, not technology. Long-unit synthesizers like AT&T NextGen for example, are very large and are deployed in tightly confined application environments like voice portal providers. The synthesis text authors are likely to be employed by the portal itself. The text is authored specifically for the portal engine, and the authors are likely to be very familiar with the performance of the system. Finally, the enormous size of the concatenative database means that much of the ability to produce very specific and expressive speech sequences already resides in the system. The economic benefits of implementing SSML are therefore probably minimal for engine providers of this type. 5 - The current draft does not make clear what will be in "the standard" and what will be optional or future extensions. The <lowlevel> tag is the most obvious example, but various other tags mentioned above are not universally implementable and would therefore prevent many systems from complying with the standard. ****The 'future study' elements are generally denoted as such, but I will increase their visibility as non-spec items. In any case, all future study elements are to be dropped before publication as a Candidate Recommendation. ****Please specify what tags, in addition to F0 elements, would not be universally implementable. 6 - There is no provision for local or language-specific additions, such as different classes of abbreviations (e.g. the distinction between a true acronym such as DEC and an abbreviation such as NEC), different types of numbers (animate versus inanimate in many languages), or the prosodic systems of tone languages. Some specific examples are discussed below, but provision for anything other than English is minimal in the current proposal. ****The 'acronym' attribute of 'say-as' will be modified to allow the explicit specification of true acronyms and abbreviations. See below. The notion of language-specific additions is intriguing and I would welcome a proposal for a general markup method for doing this. I would also welcome a concrete proposal to deal with tone languages. Specific Tags <sayas> - Several categories could be added to this tag, including telephone numbers, credit card numbers, and the distinction between acronyms (DEC, DARPA, NASA) and letter-by-letter abbreviations (USA, IBM, UK). ****'telephone' is a 'say-as' attribute value. I'm unable to envision an example where <say-as type="number:digits"> would not suffice for credit card numbers. ****I honestly don't know how 'acronym' got so mangled, but somehow it did. I think that at some point, it was pointed out that 'acronym' would be the expected default behavior of a synthesizer encountering something like 'DEC' - that is, the synthesizer would, in the absense of markup, attempt to prounce it as a word. That's why 'literal' was kept as the overriding effect of markup - but the attribute value ended up being named 'acronym' instead of 'literal' or 'spell-out'!!!!! ****Accordingly, as Richard Sproat has proposed, <say-as type="spell-out"> USA </say-as> and <say-as type="acronym"> DEC </say-as> will replace the current 'acronym' attribute value. In languages with well-developed morphology, such as Finnish or Spanish, the pronunciation of numbers and abbreviations depends not only on whether they are ordinal or cardinal but also on their gender, case and even semantic properties. These are often not explicit, or even predictable, from the text. It would be advisable to extend the <sayas> tag to include an optional "morph" value to hold such information. ****I am open to proposals in this area, but there would need to be substantially more data on it's potential usefulness. I agree with Richard Sproat that successful utilization of such a tag might require linguistic expertise that would not likely be possessed by portal and web text authors who I believe constitute the majority of potential users of this specification. I would also wonder why markup would be required to resolve ambiguity in the specification a property that would likely be already embodied as a part of the default knowledge base of a Finnish, Spanish, etc synthesizer. <voice> - It seems unnecessary to reset all prosodic aspects to their defaults when the voice changes. This prevents the natural-sounding incorporation of direct speech using a different voice, and also makes the reading of bilingual texts (common in Switzerland, Eastern Europe, the Southern USA, and other exotic places) very awkward. ALthough absolute values cannot be carried over from voice to voice, it should be possible to tranfer relative values (slow/fast, high/medium/low, etc.) quite easily. ****Usage note #3 suggests a simple method, using a style sheet, to preserve prosodic properties across voice changes, both fundamental and relative. My own experience with XML confirms that is quite straightforward. In generating the specification we had to chosse a default behavior, and 'reset' was selected as the most commonly desired result of an imbedded voice change. I will take a note to generate an example style sheet that demonstrates how prosody may be preserved and include it in the next rev. <break> - Some languages have a need for more levels of prosodic boundary below a minor pause, and some applications may require boundaries above the paragraph level. It would be advisable to add an optional "special" value for these cases. ****I'm not sure if I fully understand the issue, but I get the (weak) impression that the problem may lie in the specification of the prescribed effect on perceptual quality when a <break/> element is inserted to override default behavior in an utterance where default synthesizer behavior might be expected to vary significantly across different synthesizer engines. If this is the case, then please reference my remarks above. If not, please provide more detail and examples. ****In any case, the 'special' attribute is a rather vacuous term for the complex cases you cite above. Please provide more examples. Perhaps a suite of new attributes for these cases would be appropriate. <prosody> - There is currently no provision for languages with lexical tone. These include many commercially important languages (e.g. Chinese, Swedish, Norwegian), as well as most of the other languages of the world. ****You are correct that tonal languages are not specifically supported, although there were informal proposals to do so. I re-iterate my willingness to seriously review any formal proposal to exapnd SSML to tonal languages. <rate> - "Words per minute" values are not reliably implementable in any current synthesiser, although they may be a readily understandable measure of approximate speech rate. It is perhaps equally important to be able to specify the dynamics of speech rate - accelerations, decelerations, constancies. ****I think it is logical to assume that the 'rate' attribute of the 'prosody' element would only be employed by a markup-wielding author to alter speaking rate over long sequences. I think it is also logical to assume that, even though a given synthesizer may not possess an algorithmic representation of WPM, it would have some intrinsic notion of speech rate that could be modified to render the intent of the non-expert markup author. And not to sound like a broken record, but I would welcome a concrete proposal for adding accelerations, etc to the 'prosody' element. <audio> - Multimodal systems (e.g. animations) are likely to require precise synchronisation of audio, images and other resources. This may be beyond the scope of the proposed standard, but could be included in the <lowlevel> tag. ****Multimodal synchronization is definitely desireable but it is indeed beyond the scope of SSML. That was the original intent of the 'mark' element, which I am now recommending be removed. The reason is that this capability has been superseded for the rapidly evolving SMIL markup standard. Aron Cohen (head of SMIL and an Intel colleague) informs me that SSML elements imbedded in a SMIL sequence specification could form the basis of animation/synthesis application. I will continue to monitor SMIL with an eye to eventually producing an example fragment demonstrating this capability. Suggested Modifications 1 - Distinguish clearly between tags intended for speech synthesis developers and tags intended for application designers. Perhaps two separate markup languages (high-level and low-level) should be specified. ****I support this proposal and will work with COST 258 to further refine it. 2 - Clarify the intended resolution of conflicts between high-level and low-level markup, or explain the dangers of using both types in the same document. ****I think my remarks on the typical user profile above partially addressed this. I also think partitioning the spec as suggested in #1 may obviate this need. It might also be noted that in the absence of real data on common useage, it may be a little presumptious to assume that mixing high and low-level elements within the same document is impossible or even unadviseable. I stand by my contention however, that is probably unlikely to occur. 3 - Clarify the intended effect of tags on the default behaviour of synthesis systems. Should they be processed BEFORE the system performs its "non-markup behavior", or AFTER the default output has been calculated? Does this vary depending on the tag? ****I think this is absolutely unspecifiable and moreover largely misses the point of SSML. Any synthesizer engine developer motivated to support SSML will certainly implement each markup element in a way appropriate for that synthesizer. Which of the above routes selected will depend solely on how the specified markup effect is most efficiently executed. Given that, the thing we should focus on is whether each of the element descriptions is sufficiently detailed to allow a knowledgeable engine developer to fully discern the intended effect on rendered output. 4 - Revise the F0 tags to allow for theory-neutral interpretation: if this is bnot done, the goal of interoperability across synthesis platforms cannot be achieved. ****I stand ready to receive a proposal on how best to go about this. 5 - Clarify what constitutes the minimum core standard, and what are optional or future extensions. ****Per W3C rules, the future study sections are due to be dropped as SSML enters candidate phase. What remains will constitute the core specification. 6 - Provide a mechanism for extending the standard to include unforeseen cases, particularly language-specific or multilingual requirements. ****I stand ready to receive a proposal on how best to go about this. <sayas> - Add the categories mentioned above, plus an optional "morph" value to hold agreement information. ****See my comments above on 'say-as' <voice> - Allow the option of retaining relative prosodic attributes (pitch, rate, etc.) when the voice is changed. ****See my comments above on the use of stylesheets to preserve prosody. <break> - Add an optional "special" value to allow language-specific and application- specific extensions. ****See comments above. <prosody> - add an optional "tone" value. ****Please provide more details on what, exactly would be meant by <prosody tone="???">, which is what I believe your are attempting to propose. <audio> - Consider a <lowlevel> extension to allow synchronisation of speech with other resources. ****Superseded by SMIL. See my remarks above -----Original Message----- From: Alex.Monaghan@Aculab.com [mailto:Alex.Monaghan@Aculab.com] Sent: Monday, January 08, 2001 7:42 AM To: kuansanw@microsoft.com; mark.r.walker@intel.com Cc: eric.keller@imm.unil.ch Subject: RE: W3C speech synth mark-up - the COST258 response > dear mark and kuansan, > thanks for your emails. i do think that there are several unresolved issues in the "last call" document, and that these could make the proposal unimplementable on most current synthesis platforms. i also believe that the W3C standards have an obligation to be universal with respect to the language to be synthesised and the theoretical model to be used. unfortunately, the current proposal does not fulfil this obligation. the reasons for this are understandable: it derives from SABLE and JSML, which were quite narrow in their view of speech synthesis. i would support the suggestion that the "last call" status be recinded. if there is a lack of enthusiasm for further work on this proposal (this has been indicated by several people, and is borne out by the email archive), then i can offer an injection of such enthusiasm. i don't fully understand the workings of the W3C, but i am happy to get involved in any way i can. if the "last call" status and deadlines remain, then i would suggest either: 1 - a "layered" standard such as the MPEG standard, where the level of conformity is variable, OR 2 - an explicit mechanism for language-specific and theory-specific variations, whereby standards could evolve for awkward cases (e.g. the agreement of numbers in spanish, or the specification of declination in a Fujisaki model of F0). personally, i prefer (2) since it is less proscriptive and alows mark-up to develop organically as needed: incorporation into the standard could come later. i look forward to further discussion of these points. regards, alex.
Received on Friday, 19 January 2001 19:21:32 UTC