Comments on Mark Walker's comments from Richard Sproat on 2001-01-20 (www-voice@w3.org from January to March 2001)

From: Richard Sproat <rws@research.att.com>
Date: Sat, 20 Jan 2001 12:24:32 -0500
To: www-voice@w3.org
Cc: mark.r.walker@intel.com, Alex.Monaghan@Aculab.com
Message-Id: <200101201724.MAA21383@tabasco.research.att.com>
While I agree with much of Mark's response to COST 258's, Alex's and
my previous comments, a couple of points seem in need of further
clarification. 

With respect to the following two points:

  ****This is a significant issue and one that I was completely unaware of
  until you raised it. Obviously, the early days of the SSML requirements
  phase were dominated (apparently) by firms possessing synthesizers modeling
  intonation with the former approach.  I would welcome any proposal that
  expanded the ability of the low-level elements to specify intonation in a
  less theory-biased manner.  

  ****In answering Richard Sproat's specific concern about long-unit
  synthesizers, I will propose that the decision by any synthesis engine
  provider to support the SSML specification is probably ultimately driven by
  economics, not technology.  Long-unit synthesizers like AT&T NextGen for
  example, are very large and are deployed in tightly confined application
  environments like voice portal providers.  The synthesis text authors are
  likely to be employed by the portal itself.  The text is authored
  specifically for the portal engine, and the authors are likely to be very
  familiar with the performance of the system.  Finally, the enormous size of
  the concatenative database means that much of the ability to produce very
  specific and expressive speech sequences already resides in the system.  The
  economic benefits of implementing SSML are therefore probably minimal for
  engine providers of this type.

The points are actually closely related.

Let's start with the second point, and let's grant for the sake of
argument that Mark is right when he says that "the ability to produce
very specific and expressive speech sequences already resides" in
large unit synthesizers.  

So let's say that you want to synthesize a particular utterance with a
particular prosody. Will the particular desired prosody be the one
that comes out of the system? Chances are it will not. So what are you
supposed to do about that. Obviously one thing you could do is simply
accept the output of the system, assuming (again for the sake of
argument) that it sounds "natural" and "expressive". But if you really
don't want it said that way, then you have a problem. Presumably, in
that case, you still want a markup scheme to be able to control the
output. 

Large database methods provide a couple of possibilities here:

1) With luck the alternative you want may be in the database already,
   and you just have to squeeze it out somehow, presumably with the
   use of markup.

2) The system may allow runtime modification of the output in which
   case the same kinds of controls that are already present in more
   traditional approaches to synthesis will be available. 

At the present state, developers of large concatenative unit systems
seem to be putting their faith in (1), which will be perhaps
reasonable for restricted domains, not reasonable for unrestricted
domains (e.g. independently authored material). Option (2) is
deprecated largely because once you start fiddling too much with the
prosody, things tend to degrade: the AT&T NextGen system, with
*complete* control of prosody using the old AT&T
(Liberman-Pierrehumbert) intonation model sounds only marginally
better than the earlier AT&T system it supplanted. (I know this
because I have heard it.) That's of course a worst-case scenario, but
other attempts to impose a synthetic contour may lead to output which
is to some degree degraded. No doubt this situation will change as
people figure out ways to improve prosody modification, but in the
current situation what you have is a system that will not necessarily
be able to implement what you want to hear.

Once again, the system's rendition may well be perfectly natural (or
not), but it may equally well not be what you want.

I don't understand Mark's claim about systems such as AT&T NextGen
being deployed in "tightly constrained" environments such as voice
portals. Even now the system is being used to read e-mail for AOL and
Yahoo. E-mail can hardly be called a tightly constrained environment
(and this fact shows in the renditions you get). I see no reason why
systems such as NextGen should not be offered in services where one
wants to produce custom output for text of the user's -- not the voice
portal provider's -- choice. And in such situations, developers who
don't work for the voice portal will want access to sensible markup
possibilities, just as random individuals currently have access to
HTML for customizing their own web pages.

Turning now to the first point, while it is technically accurate that
the early design was proposed by people whose bent was strongly
towards the Pierrehumbert camp, I don't think this changes the point
that you want to have controls of certain aspects of intonation
independently of the theory you adopt. While terms like "topline" or
"baseline" may have no meaning in some approaches, one would still
like some way of implementing the idea that a particular passage of
text should be rendered within a certain pitch range. Presumably most
theories of intonation can accommodate such notions, and given that,
it seems to me that it is largely a matter of implementational detail
whether or not they actually have primitives such as "topline" or
"baseline" in the system.  The reason this first point is related to
the second is that if some implemented models of intonation do *not*
allow this kind of modification, then this again becomes an issue of
whether or not a markup scheme such as SSML should accommodate such
models, by effectively weakening what one has control over.

  I would also welcome a concrete proposal to deal with tone
  languages.

Tone specifications are part of phonetic transcriptions: there are
standards for transcribing tone as part of the phonetic transcription
for any language for which this is relevant. The issue here, as Alex
points out, is how to specify only the tone, without having to specify
the rest of the phonetic details. One way to do this would be to allow
a modifier attribute for a phonetic transcription that says that the
transcription is only a tone transcription. So, to specify a 1-2 tone
sequence for a Mandarin disyllabic word, one might have something
like: 

 <phoneme ph="1-2" type="tone">


  In languages with well-developed morphology, such as Finnish or Spanish, the
  pronunciation of numbers and abbreviations depends not only on whether they
  are ordinal
  or cardinal but also on their gender, case and even semantic properties.
  These are often
  not explicit, or even predictable, from the text. It would be advisable to
  extend the
  <sayas> tag to include an optional "morph" value to hold such information.


  ****I am open to proposals in this area, but there would need to be
  substantially more data on it's potential usefulness.  I agree with Richard
  Sproat that successful utilization of such a tag might require linguistic
  expertise that would not likely be possessed by portal and web text authors
  who I believe constitute the majority of potential users of this
  specification.  I would also wonder why markup would be required to resolve
  ambiguity in the specification a property that would likely be already
  embodied as a part of the default knowledge base of a Finnish, Spanish, etc
  synthesizer.

I think the only way to do this and have it be usable by
non-specialists is going to be to allow users to spell out the way
they want to say a particular, e.g., number if the system doesn't get
it right. This is invariably going to happen: a system for Finnish or
Spanish certainly would have this kind of info as part of the
knowledge base, but it is going to make mistakes, and users need to
have some way of correcting those mistakes.

--

Richard Sproat               Human/Computer Interaction Research
rws@research.att.com         AT&T Labs -- Research, Shannon Laboratory
Tel: +1-973-360-8490         180 Park Avenue, Room B207, P.O.Box 971
Fax: +1-973-360-8809         Florham Park, NJ 07932-0000
----------------http://www.research.att.com/~rws/-----------------------
Received on Saturday, 20 January 2001 12:25:04 UTC