RE: mark's and richard's comments on SSML from Walker, Mark R on 2001-01-22 (www-voice@w3.org from January to March 2001)

From: Walker, Mark R <mark.r.walker@intel.com>
Date: Mon, 22 Jan 2001 13:20:31 -0800
To: "'Alex.Monaghan@Aculab.com'" <Alex.Monaghan@Aculab.com>
Cc: www-voice@w3.org
Message-ID: <638EC1B28663D211AC3E00A0C96B78A8042B39E9@orsmsx40.jf.intel.com>
Specific issue: Are SSML-compliant engines "...free to do nothing with the
tags as long as their parsers don't complain?"  
This is not-allowed under SSML conformance criteria:
http://www.w3.org/TR/speech-synthesis#S5.3


-Regards,

Mark






-----Original Message-----
From: Alex.Monaghan@Aculab.com [mailto:Alex.Monaghan@Aculab.com]
Sent: Monday, January 22, 2001 2:23 AM
To: www-voice@w3.org
Subject: Re: mark's and richard's comments on SSML


(i've only sent this to the group, to avoid multiple copies.)

this is only a partial response to mark and richard, whom i thank for their
comments: i'll digest them and respond in due course.

one thing bothering me is the meaning of system compliance. i was assuming a
definition along the lines of "System X will accept input containing SSML,
and will do something sensible and predictable with it", which would serve
the goal of consistent cross-platform output. what mark and richard seem to
be saying (Mark: "it seems highly likely that
high-level elements like 'break' will result in highly variable output
across different synthesizers, and that will seen be as normal and perfectly
compliant behavior. " Richard: "in the current situation what you have is a
system that will not necessarily be able to implement what you want to
hear.") is that compliance will be defined as "System X will accept intput
containing SSML, but there are no guarantees as to what it wil do with it" -
this would make many HTML parsers SSML-compliant immediately.

i realise that complete consistency of implementation cannot be guaranteed
across different synthesisers. are there any plans to ensure that compliant
systems actually respond appropriately to SSML tags, or will they be free to
do nothing with the tags as long as their parser doesn't complain?

	
alex.


> -----Original Message-----
> From:	Richard Sproat [SMTP:rws@research.att.com]
> Sent:	20 January 2001 17:25
> To:	www-voice@w3.org
> Cc:	mark.r.walker@intel.com; Alex.Monaghan@Aculab.com
> Subject:	Comments on Mark Walker's comments
> 
> 
> While I agree with much of Mark's response to COST 258's, Alex's and
> my previous comments, a couple of points seem in need of further
> clarification. 
> 
> With respect to the following two points:
> 
>   ****This is a significant issue and one that I was completely unaware of
>   until you raised it. Obviously, the early days of the SSML requirements
>   phase were dominated (apparently) by firms possessing synthesizers
> modeling
>   intonation with the former approach.  I would welcome any proposal that
>   expanded the ability of the low-level elements to specify intonation in
> a
>   less theory-biased manner.  
> 
>   ****In answering Richard Sproat's specific concern about long-unit
>   synthesizers, I will propose that the decision by any synthesis engine
>   provider to support the SSML specification is probably ultimately driven
> by
>   economics, not technology.  Long-unit synthesizers like AT&T NextGen for
>   example, are very large and are deployed in tightly confined application
>   environments like voice portal providers.  The synthesis text authors
> are
>   likely to be employed by the portal itself.  The text is authored
>   specifically for the portal engine, and the authors are likely to be
> very
>   familiar with the performance of the system.  Finally, the enormous size
> of
>   the concatenative database means that much of the ability to produce
> very
>   specific and expressive speech sequences already resides in the system.
> The
>   economic benefits of implementing SSML are therefore probably minimal
> for
>   engine providers of this type.
> 
> The points are actually closely related.
> 
> Let's start with the second point, and let's grant for the sake of
> argument that Mark is right when he says that "the ability to produce
> very specific and expressive speech sequences already resides" in
> large unit synthesizers.  
> 
> So let's say that you want to synthesize a particular utterance with a
> particular prosody. Will the particular desired prosody be the one
> that comes out of the system? Chances are it will not. So what are you
> supposed to do about that. Obviously one thing you could do is simply
> accept the output of the system, assuming (again for the sake of
> argument) that it sounds "natural" and "expressive". But if you really
> don't want it said that way, then you have a problem. Presumably, in
> that case, you still want a markup scheme to be able to control the
> output. 
> 
> Large database methods provide a couple of possibilities here:
> 
> 1) With luck the alternative you want may be in the database already,
>    and you just have to squeeze it out somehow, presumably with the
>    use of markup.
> 
> 2) The system may allow runtime modification of the output in which
>    case the same kinds of controls that are already present in more
>    traditional approaches to synthesis will be available. 
> 
> At the present state, developers of large concatenative unit systems
> seem to be putting their faith in (1), which will be perhaps
> reasonable for restricted domains, not reasonable for unrestricted
> domains (e.g. independently authored material). Option (2) is
> deprecated largely because once you start fiddling too much with the
> prosody, things tend to degrade: the AT&T NextGen system, with
> *complete* control of prosody using the old AT&T
> (Liberman-Pierrehumbert) intonation model sounds only marginally
> better than the earlier AT&T system it supplanted. (I know this
> because I have heard it.) That's of course a worst-case scenario, but
> other attempts to impose a synthetic contour may lead to output which
> is to some degree degraded. No doubt this situation will change as
> people figure out ways to improve prosody modification, but in the
> current situation what you have is a system that will not necessarily
> be able to implement what you want to hear.
> 
> Once again, the system's rendition may well be perfectly natural (or
> not), but it may equally well not be what you want.
> 
> I don't understand Mark's claim about systems such as AT&T NextGen
> being deployed in "tightly constrained" environments such as voice
> portals. Even now the system is being used to read e-mail for AOL and
> Yahoo. E-mail can hardly be called a tightly constrained environment
> (and this fact shows in the renditions you get). I see no reason why
> systems such as NextGen should not be offered in services where one
> wants to produce custom output for text of the user's -- not the voice
> portal provider's -- choice. And in such situations, developers who
> don't work for the voice portal will want access to sensible markup
> possibilities, just as random individuals currently have access to
> HTML for customizing their own web pages.
> 
> Turning now to the first point, while it is technically accurate that
> the early design was proposed by people whose bent was strongly
> towards the Pierrehumbert camp, I don't think this changes the point
> that you want to have controls of certain aspects of intonation
> independently of the theory you adopt. While terms like "topline" or
> "baseline" may have no meaning in some approaches, one would still
> like some way of implementing the idea that a particular passage of
> text should be rendered within a certain pitch range. Presumably most
> theories of intonation can accommodate such notions, and given that,
> it seems to me that it is largely a matter of implementational detail
> whether or not they actually have primitives such as "topline" or
> "baseline" in the system.  The reason this first point is related to
> the second is that if some implemented models of intonation do *not*
> allow this kind of modification, then this again becomes an issue of
> whether or not a markup scheme such as SSML should accommodate such
> models, by effectively weakening what one has control over.
> 
>   I would also welcome a concrete proposal to deal with tone
>   languages.
> 
> Tone specifications are part of phonetic transcriptions: there are
> standards for transcribing tone as part of the phonetic transcription
> for any language for which this is relevant. The issue here, as Alex
> points out, is how to specify only the tone, without having to specify
> the rest of the phonetic details. One way to do this would be to allow
> a modifier attribute for a phonetic transcription that says that the
> transcription is only a tone transcription. So, to specify a 1-2 tone
> sequence for a Mandarin disyllabic word, one might have something
> like: 
> 
>  <phoneme ph="1-2" type="tone">
> 
> 
>   In languages with well-developed morphology, such as Finnish or Spanish,
> the
>   pronunciation of numbers and abbreviations depends not only on whether
> they
>   are ordinal
>   or cardinal but also on their gender, case and even semantic properties.
>   These are often
>   not explicit, or even predictable, from the text. It would be advisable
> to
>   extend the
>   <sayas> tag to include an optional "morph" value to hold such
> information.
> 
> 
>   ****I am open to proposals in this area, but there would need to be
>   substantially more data on it's potential usefulness.  I agree with
> Richard
>   Sproat that successful utilization of such a tag might require
> linguistic
>   expertise that would not likely be possessed by portal and web text
> authors
>   who I believe constitute the majority of potential users of this
>   specification.  I would also wonder why markup would be required to
> resolve
>   ambiguity in the specification a property that would likely be already
>   embodied as a part of the default knowledge base of a Finnish, Spanish,
> etc
>   synthesizer.
> 
> I think the only way to do this and have it be usable by
> non-specialists is going to be to allow users to spell out the way
> they want to say a particular, e.g., number if the system doesn't get
> it right. This is invariably going to happen: a system for Finnish or
> Spanish certainly would have this kind of info as part of the
> knowledge base, but it is going to make mistakes, and users need to
> have some way of correcting those mistakes.
> 
> --
> 
> Richard Sproat               Human/Computer Interaction Research
> rws@research.att.com         AT&T Labs -- Research, Shannon Laboratory
> Tel: +1-973-360-8490         180 Park Avenue, Room B207, P.O.Box 971
> Fax: +1-973-360-8809         Florham Park, NJ 07932-0000
> ----------------http://www.research.att.com/~rws/-----------------------
> 
>
Received on Monday, 22 January 2001 16:21:18 UTC