- From: Richard Sproat <rws@research.att.com>
- Date: Sat, 20 Jan 2001 12:24:32 -0500
- To: www-voice@w3.org
- Cc: mark.r.walker@intel.com, Alex.Monaghan@Aculab.com
While I agree with much of Mark's response to COST 258's, Alex's and my previous comments, a couple of points seem in need of further clarification. With respect to the following two points: ****This is a significant issue and one that I was completely unaware of until you raised it. Obviously, the early days of the SSML requirements phase were dominated (apparently) by firms possessing synthesizers modeling intonation with the former approach. I would welcome any proposal that expanded the ability of the low-level elements to specify intonation in a less theory-biased manner. ****In answering Richard Sproat's specific concern about long-unit synthesizers, I will propose that the decision by any synthesis engine provider to support the SSML specification is probably ultimately driven by economics, not technology. Long-unit synthesizers like AT&T NextGen for example, are very large and are deployed in tightly confined application environments like voice portal providers. The synthesis text authors are likely to be employed by the portal itself. The text is authored specifically for the portal engine, and the authors are likely to be very familiar with the performance of the system. Finally, the enormous size of the concatenative database means that much of the ability to produce very specific and expressive speech sequences already resides in the system. The economic benefits of implementing SSML are therefore probably minimal for engine providers of this type. The points are actually closely related. Let's start with the second point, and let's grant for the sake of argument that Mark is right when he says that "the ability to produce very specific and expressive speech sequences already resides" in large unit synthesizers. So let's say that you want to synthesize a particular utterance with a particular prosody. Will the particular desired prosody be the one that comes out of the system? Chances are it will not. So what are you supposed to do about that. Obviously one thing you could do is simply accept the output of the system, assuming (again for the sake of argument) that it sounds "natural" and "expressive". But if you really don't want it said that way, then you have a problem. Presumably, in that case, you still want a markup scheme to be able to control the output. Large database methods provide a couple of possibilities here: 1) With luck the alternative you want may be in the database already, and you just have to squeeze it out somehow, presumably with the use of markup. 2) The system may allow runtime modification of the output in which case the same kinds of controls that are already present in more traditional approaches to synthesis will be available. At the present state, developers of large concatenative unit systems seem to be putting their faith in (1), which will be perhaps reasonable for restricted domains, not reasonable for unrestricted domains (e.g. independently authored material). Option (2) is deprecated largely because once you start fiddling too much with the prosody, things tend to degrade: the AT&T NextGen system, with *complete* control of prosody using the old AT&T (Liberman-Pierrehumbert) intonation model sounds only marginally better than the earlier AT&T system it supplanted. (I know this because I have heard it.) That's of course a worst-case scenario, but other attempts to impose a synthetic contour may lead to output which is to some degree degraded. No doubt this situation will change as people figure out ways to improve prosody modification, but in the current situation what you have is a system that will not necessarily be able to implement what you want to hear. Once again, the system's rendition may well be perfectly natural (or not), but it may equally well not be what you want. I don't understand Mark's claim about systems such as AT&T NextGen being deployed in "tightly constrained" environments such as voice portals. Even now the system is being used to read e-mail for AOL and Yahoo. E-mail can hardly be called a tightly constrained environment (and this fact shows in the renditions you get). I see no reason why systems such as NextGen should not be offered in services where one wants to produce custom output for text of the user's -- not the voice portal provider's -- choice. And in such situations, developers who don't work for the voice portal will want access to sensible markup possibilities, just as random individuals currently have access to HTML for customizing their own web pages. Turning now to the first point, while it is technically accurate that the early design was proposed by people whose bent was strongly towards the Pierrehumbert camp, I don't think this changes the point that you want to have controls of certain aspects of intonation independently of the theory you adopt. While terms like "topline" or "baseline" may have no meaning in some approaches, one would still like some way of implementing the idea that a particular passage of text should be rendered within a certain pitch range. Presumably most theories of intonation can accommodate such notions, and given that, it seems to me that it is largely a matter of implementational detail whether or not they actually have primitives such as "topline" or "baseline" in the system. The reason this first point is related to the second is that if some implemented models of intonation do *not* allow this kind of modification, then this again becomes an issue of whether or not a markup scheme such as SSML should accommodate such models, by effectively weakening what one has control over. I would also welcome a concrete proposal to deal with tone languages. Tone specifications are part of phonetic transcriptions: there are standards for transcribing tone as part of the phonetic transcription for any language for which this is relevant. The issue here, as Alex points out, is how to specify only the tone, without having to specify the rest of the phonetic details. One way to do this would be to allow a modifier attribute for a phonetic transcription that says that the transcription is only a tone transcription. So, to specify a 1-2 tone sequence for a Mandarin disyllabic word, one might have something like: <phoneme ph="1-2" type="tone"> In languages with well-developed morphology, such as Finnish or Spanish, the pronunciation of numbers and abbreviations depends not only on whether they are ordinal or cardinal but also on their gender, case and even semantic properties. These are often not explicit, or even predictable, from the text. It would be advisable to extend the <sayas> tag to include an optional "morph" value to hold such information. ****I am open to proposals in this area, but there would need to be substantially more data on it's potential usefulness. I agree with Richard Sproat that successful utilization of such a tag might require linguistic expertise that would not likely be possessed by portal and web text authors who I believe constitute the majority of potential users of this specification. I would also wonder why markup would be required to resolve ambiguity in the specification a property that would likely be already embodied as a part of the default knowledge base of a Finnish, Spanish, etc synthesizer. I think the only way to do this and have it be usable by non-specialists is going to be to allow users to spell out the way they want to say a particular, e.g., number if the system doesn't get it right. This is invariably going to happen: a system for Finnish or Spanish certainly would have this kind of info as part of the knowledge base, but it is going to make mistakes, and users need to have some way of correcting those mistakes. -- Richard Sproat Human/Computer Interaction Research rws@research.att.com AT&T Labs -- Research, Shannon Laboratory Tel: +1-973-360-8490 180 Park Avenue, Room B207, P.O.Box 971 Fax: +1-973-360-8809 Florham Park, NJ 07932-0000 ----------------http://www.research.att.com/~rws/-----------------------
Received on Saturday, 20 January 2001 12:25:04 UTC