RE: W3C speech synth mark-up - the COST 258 response

Alex-

****Attached below is a detailed response to your original remarks.
Again, thank you and your COST 258 colleagues for taking the time
to review and comment on the voice browser SSML.  In responding,
I have adopted some of the sugestions made by Richard Sproat.  His
comments and the COST 258 comments now form the bulk of changes
now proposed for SSML during this last call phase. 




A Response to the W3C Draft Proposal for a Speech Synthesis Mark-up Language

from COST 258, European Co-Operative Action on Improving the Naturalness of
Speech Synthesis

(http://www.unil.ch/imm/docs/LAIP/COST_258/cost258.htm)

Editor: Alex Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com)


Background

COST 258 is a consortium of European speech synthesis experts from 17
countries.
It is funded by the European Commission, and its aim is to promote
co-operative research
to improve the naturalness of synthetic speech. Its members come from both
academic and
industrial R&D centres, including at least five providers of comercial
speech synthesis
systems. For more information, see the website given above.

The W3C proposal was discussed at a meeting of COST 258 in September 2000.
The present
document collates the subsequent reactions and responses from members.
It makes both general and specific points about the proposal, and suggests
several
modifications. While we are broadly sympathetic to, and supportive of, the
attempt to
standardise speech synthesis markup and to increase consistency across
different
synthesisers, we feel that there are many obstacles to such an attempt and
that some of
these obstacles are currently insurmountable.


General Points

1 - It is not clear who the intended users of this markup language are.
There are two
obvious types of possible users: speech synthesis system developers, and
application
developers. The former may well be concerned with low-level details of
timing, pitch and
pronunciation, and be able to specify these details (F0 targets, phonetic
transcriptions, pause durations, etc.). The latter group are much more
likely to be
concerned with specifying higher-level notions such as levels of boundary,
degrees of
emphasis, fast vs slow speech rate, and formal vs casual pronunciation.


****The confusion regarding the intended users of SSML is on target.  The
specification of support for markup at different levels of the synthesis
process has resulted in a markup language that, as Richard Sproat points
out, contains element groups addressed at both expert and non-expert users.
This was not by design however, but was simply the result of the voice
browser committee deliberations.  Having said that, we believe it is
ultimately advantageous to have a range of elements that enable high, mid,
and low-level control over synthesized output.

****As I mentioned in my previous letter, I believe it is unlikely that the
low-level elements will ever be employed directly by application developers,
even those developers that possess synthesis expertise. Low-level element
sequences are most likely to be the output generated by SSML authoring tools
employed by both expert and non-expert content authors in the preparation of
synthesize-able material.  It is not unreasonable to imagine that the tool
users would use the high-level markup in the user interface, while
'compiling' the resulting text into XML pitch/phoneme sequences that closely
represented the wishes of the author for the rendered output.  This is
obviously highly speculative, since there are currently no tools of this
sort currently available. I can report however, that this usage scenario has
been discussed within Intel and in other venues with other companies outside
of W3C, and has generally been met with approval.



2 - It is clear that the proposal includes two, possibly three, different
levels of
markup. For F0, for instance, there is the <emphasis> tag (which would be
realised as a
pitch excursion in most systems), the <prosody contour> tag which allows
finer control,
and the low-level <pitch> tag which is a proposed extension. There is very
little
indication of best practice in the use of these different levels (e.g. which
type of
user should use which level), and no explanation of what should happen if
the different
levels are combined (e.g. a <pitch contour> specification inside an
<emphasis>
environment).


****Again, the current form of the specification was largely developed in a
vacuum of information on potential usage models.  As the specification moves
into recommendation form and is slowly adopted within the speech application
community, it is anticipated that best practice and common usage information
will emerge and will be in incorporated in future revisions of the
specification.  It might be anticipated, based on my response to remark #1,
that the inter-mxing of both high and low-level markup in a given
application would be unlikely.  I might also remark that you have suggested
that this would be difficult, but not impossible.  Does the possibility of
mixing high and low elements really represent an **insurmoutable** barrier
to supporting SSML?  Please provide more detail.


3 - The notion of "non-markup behavior" is confusing. On the one hand, there
seems to be
an assumption that markup will not affect the behaviour of the system
outside the tags,
and that the markup therefore complements the system's unmarked performance,
but on the
other hand there are references to "over-riding" the system's default
behaviour. In
general, it is unclear whether markup is intende to be superimposed on the
default
behaviour or to provide information which modifies that behaviour.


****The effect of markup in supplanting the un-marked behavior of a given
synthesizer obviously depends on the type and the particular markup level
being specified.  In some cases, default behavior will be completely
supplanted, and in other cases markup will result in the superimposition of
relative effects.  The text-analysis guides like 'say-as' are relatively
unambiguous in that they completely supplant the default synthesizer
behavior.  Specifying 'emphasis' however, will likely result in relative
changes in the pitch, duration, and etc of the utterance that will manifest
perceptually as an imposition. 


The use of the <break> element, for instance, is apparently intended "to
override the
typical automatic behavior", but the insertion of a <break> tag may have
non-local
repercussions which are very hard to predict. Take a system which assigns
prosodic
boundaries stochastically, and attempts to balance the number and length of
units at
each prosodic level. The "non-markup behavior" of such a system might take
the input
"Big fat cigars, lots of money." and produce two balanced units: but will
the input
"Big fat <break/> cigars, lots of money." produce three unbalanced units
(big fat,
cigars, lots of money), or three more balanced units (big fat, cigars lots,
of money),
or four balanced units (big fat, cigars, lots of, money), or six single-word
units, or
something else? Which would be the correct interpretation of the markup?


****In this instance, a synthesis text author would reasonably be expected
to specify more precisely exactly what he/she intended for the resulting
prosodic units by adding a 'size' or a 'time' attribute to the 'break'
markup element.  Having said that, I must say I have no idea what the common
default behavior for the above utterance would be, and for that reason it
would definitely NOT be logical to add markup to alter it.  It seems
reasonable that even a non-expert synthesis text author would utilize common
sense and realize that specifying markup in sequences where default
synthesizer behavior is highly variable would be a dicey proposition. 

****However, in addressing your concern it seems highly likely that
high-level elements like 'break' will result in highly variable output
across different synthesizers, and that will seen be as normal and perfectly
compliant behavior.  Systems like the one you cite above might elect to
'balance' the effects of the <break/> insertion over a context larger than
the specified interval.  Others may elect to completely localize the effect,
even to the detriment of the perceptual quality.  Both behaviors would be
compliant, and both would essentially convey the coarsely specified intent
of the markup text author.  I will place verbage within the spec that makes
this point clear.


4 - Many of the tags related to F0 presuppose that pitch is represented as a
linear
sequence of targets. This is the case for some synthesisers, particularly
those using
theories of intonation based on the work of Bruce, Ladd or Pierrehumbert.
However, the
equally well-known Fujisaki approach is also commonly used in synthesis
systems, as are
techniques involving the concetenation of natural or stylised F0 contours:
in these
approaches, notions such as pitch targets, baselines and ranges have very
different
meanings and in some cases no meaning at all. The current proposal is thus
far from
theory-neutral, and is not implementable in many current synthesisers.


****This is a significant issue and one that I was completely unaware of
until you raised it. Obviously, the early days of the SSML requirements
phase were dominated (apparently) by firms possessing synthesizers modeling
intonation with the former approach.  I would welcome any proposal that
expanded the ability of the low-level elements to specify intonation in a
less theory-biased manner.  

****In answering Richard Sproat's specific concern about long-unit
synthesizers, I will propose that the decision by any synthesis engine
provider to support the SSML specification is probably ultimately driven by
economics, not technology.  Long-unit synthesizers like AT&T NextGen for
example, are very large and are deployed in tightly confined application
environments like voice portal providers.  The synthesis text authors are
likely to be employed by the portal itself.  The text is authored
specifically for the portal engine, and the authors are likely to be very
familiar with the performance of the system.  Finally, the enormous size of
the concatenative database means that much of the ability to produce very
specific and expressive speech sequences already resides in the system.  The
economic benefits of implementing SSML are therefore probably minimal for
engine providers of this type.




5 - The current draft does not make clear what will be in "the standard" and
what will
be optional or future extensions. The <lowlevel> tag is the most obvious
example, but
various other tags mentioned above are not universally implementable and
would therefore
prevent many systems from complying with the standard.

****The 'future study' elements are generally denoted as such, but I will
increase their visibility as non-spec items.  In any case, all future study
elements are to be dropped before publication as a Candidate Recommendation.

****Please specify what tags, in addition to F0 elements, would not be
universally implementable.



6 - There is no provision for local or language-specific additions, such as
different
classes of abbreviations (e.g. the distinction between a true acronym such
as DEC and an
abbreviation such as NEC), different types of numbers (animate versus
inanimate in many
languages), or the prosodic systems of tone languages. Some specific
examples are
discussed below, but provision for anything other than English is minimal in
the current
proposal.

****The 'acronym' attribute of 'say-as' will be modified to allow the
explicit specification of true acronyms and abbreviations. See below.  The
notion of language-specific additions is intriguing and I would welcome a
proposal for a general markup method for doing this.  I would also welcome a
concrete proposal to deal with tone languages.


Specific Tags

<sayas> - Several categories could be added to this tag, including telephone
numbers,
credit card numbers, and the distinction between acronyms (DEC, DARPA, NASA)
and
letter-by-letter abbreviations (USA, IBM, UK).


****'telephone' is a 'say-as' attribute value.  I'm unable to envision an
example where <say-as type="number:digits"> would not suffice for credit
card numbers.  

****I honestly don't know how 'acronym' got so mangled, but somehow it did.
I think that at some point, it was pointed out that 'acronym' would be the
expected default behavior of a synthesizer encountering something like 'DEC'
- that is, the synthesizer would, in the absense of markup, attempt to
prounce it as a word.  That's why 'literal' was kept as the overriding
effect of markup - but the attribute value ended up being named 'acronym'
instead of 'literal' or 'spell-out'!!!!!

****Accordingly, as Richard Sproat has proposed, 
<say-as type="spell-out"> USA </say-as> 
and 
<say-as type="acronym"> DEC </say-as> 
will replace the current 'acronym' attribute value.


In languages with well-developed morphology, such as Finnish or Spanish, the
pronunciation of numbers and abbreviations depends not only on whether they
are ordinal
or cardinal but also on their gender, case and even semantic properties.
These are often
not explicit, or even predictable, from the text. It would be advisable to
extend the
<sayas> tag to include an optional "morph" value to hold such information.


****I am open to proposals in this area, but there would need to be
substantially more data on it's potential usefulness.  I agree with Richard
Sproat that successful utilization of such a tag might require linguistic
expertise that would not likely be possessed by portal and web text authors
who I believe constitute the majority of potential users of this
specification.  I would also wonder why markup would be required to resolve
ambiguity in the specification a property that would likely be already
embodied as a part of the default knowledge base of a Finnish, Spanish, etc
synthesizer.


<voice> - It seems unnecessary to reset all prosodic aspects to their
defaults when the
voice changes. This prevents the natural-sounding incorporation of direct
speech using a
different voice, and also makes the reading of bilingual texts (common in
Switzerland,
Eastern Europe, the Southern USA, and other exotic places) very awkward.
ALthough
absolute values cannot be carried over from voice to voice, it should be
possible to
tranfer relative values (slow/fast, high/medium/low, etc.) quite easily.


****Usage note #3 suggests a simple method, using a style sheet, to preserve
prosodic properties across voice changes, both fundamental and relative.  My
own experience with XML confirms that is quite straightforward.  In
generating the specification we had to chosse a default behavior, and
'reset' was selected as the most commonly desired result of an imbedded
voice change.  I will take a note to generate an example style sheet that
demonstrates how prosody may be preserved and include it in the next rev.



<break> - Some languages have a need for more levels of prosodic boundary
below a minor
pause, and some applications may require boundaries above the paragraph
level. It would
be advisable to add an optional "special" value for these cases.

****I'm not sure if I fully understand the issue, but I get the (weak)
impression that the problem may lie in the specification of the prescribed
effect on perceptual quality when a <break/> element is inserted to override
default behavior in an utterance where default synthesizer behavior might be
expected to vary significantly across different synthesizer engines.  If
this is the case, then please reference my remarks above.  If not, please
provide more detail and examples.

****In any case, the 'special' attribute is a rather vacuous term for the
complex cases you cite above.  Please provide more examples.  Perhaps a
suite of new attributes for these cases would be appropriate.



<prosody> - There is currently no provision for languages with lexical tone.
These
include many commercially important languages (e.g. Chinese, Swedish,
Norwegian), as
well as most of the other languages of the world.

****You are correct that tonal languages are not specifically supported,
although there were informal proposals to do so.  I re-iterate my
willingness to seriously review any formal proposal to exapnd SSML to tonal
languages.



<rate> - "Words per minute" values are not reliably implementable in any
current
synthesiser, although they may be a readily understandable measure of
approximate speech
rate. It is perhaps equally important to be able to specify the dynamics of
speech rate
- accelerations, decelerations, constancies.

****I think it is logical to assume that the 'rate' attribute of the
'prosody' element would only be employed by a markup-wielding author to
alter speaking rate over long sequences.  I think it is also logical to
assume that, even though a given synthesizer may not possess an algorithmic
representation of WPM, it would have some intrinsic notion of speech rate
that could be modified to render the intent of the non-expert markup author.
And not to sound like a broken record, but I would welcome a concrete
proposal for adding accelerations, etc to the 'prosody' element.




<audio> - Multimodal systems (e.g. animations) are likely to require precise
synchronisation of audio, images and other resources. This may be beyond the
scope of
the proposed standard, but could be included in the <lowlevel> tag.

****Multimodal synchronization is definitely desireable but it is indeed
beyond the scope of SSML.  That was the original intent of the 'mark'
element, which I am now recommending be removed.  The reason is that this
capability has been superseded for the rapidly evolving SMIL markup
standard. Aron Cohen (head of SMIL and an Intel colleague) informs me that
SSML elements imbedded in a SMIL sequence specification could form the basis
of animation/synthesis application.  I will continue to monitor SMIL with an
eye to eventually producing an example fragment demonstrating this
capability.



Suggested Modifications


1 - Distinguish clearly between tags intended for speech synthesis
developers and tags
intended for application designers. Perhaps two separate markup languages
(high-level
and low-level) should be specified.

****I support this proposal and will work with COST 258 to further refine
it.

2 - Clarify the intended resolution of conflicts between high-level and
low-level
markup, or explain the dangers of using both types in the same document.

****I think my remarks on the typical user profile above partially addressed
this.  I also think partitioning the spec as suggested in #1 may obviate
this need.  It might also be noted that in the absence of real data on
common useage, it may be a little presumptious to assume that mixing high
and low-level elements within the same document is impossible or even
unadviseable.  I stand by my contention however, that is probably unlikely
to occur.

3 - Clarify the intended effect of tags on the default behaviour of
synthesis systems.
Should they be processed BEFORE the system performs its "non-markup
behavior", or AFTER
the default output has been calculated? Does this vary depending on the tag?

****I think this is absolutely unspecifiable and moreover largely misses the
point of SSML.  Any synthesizer engine developer motivated to support SSML
will certainly implement each markup element in a way appropriate for that
synthesizer.  Which of the above routes selected will depend solely on how
the specified markup effect is most efficiently executed.  Given that, the
thing we should focus on is whether each of the element descriptions is
sufficiently detailed to allow a knowledgeable engine developer to fully
discern the intended effect on rendered output.



4 - Revise the F0 tags to allow for theory-neutral interpretation: if this
is bnot done,
the goal of interoperability across synthesis platforms cannot be achieved.

****I stand ready to receive a proposal on how best to go about this.




5 - Clarify what constitutes the minimum core standard, and what are
optional or future
extensions.

****Per W3C rules, the future study sections are due to be dropped as SSML
enters candidate phase.  What remains will constitute the core
specification.



6 - Provide a mechanism for extending the standard to include unforeseen
cases,
particularly language-specific or multilingual requirements.

****I stand ready to receive a proposal on how best to go about this.


<sayas> - Add the categories mentioned above, plus an optional "morph" value
to hold
agreement information.

****See my comments above on 'say-as'


<voice> - Allow the option of retaining relative prosodic attributes (pitch,
rate, etc.)
when the voice is changed.

****See my comments above on the use of stylesheets to preserve prosody.



<break> - Add an optional "special" value to allow language-specific and
application-
specific extensions.

****See comments above.

<prosody> - add an optional "tone" value.

****Please provide more details on what, exactly would be meant by <prosody
tone="???">, which is what I believe your are attempting to propose.

<audio> - Consider a <lowlevel> extension to allow synchronisation of speech
with other
resources.

****Superseded by SMIL.  See my remarks above




-----Original Message-----
From: Alex.Monaghan@Aculab.com [mailto:Alex.Monaghan@Aculab.com]
Sent: Monday, January 08, 2001 7:42 AM
To: kuansanw@microsoft.com; mark.r.walker@intel.com
Cc: eric.keller@imm.unil.ch
Subject: RE: W3C speech synth mark-up - the COST258 response


> dear mark and kuansan,
> 
thanks for your emails.

i do think that there are several unresolved issues in the "last call"
document, and that these could make the proposal unimplementable on most
current synthesis platforms.

i also believe that the W3C standards have an obligation to be universal
with respect to the language to be synthesised and the theoretical model to
be used. unfortunately, the current proposal does not fulfil this
obligation. the reasons for this are understandable: it derives from SABLE
and JSML, which were quite narrow in their view of speech synthesis.

i would support the suggestion that the "last call" status be recinded. if
there is a lack of enthusiasm for further work on this proposal (this has
been indicated by several people, and is borne out by the email archive),
then i can offer an injection of such enthusiasm. i don't fully understand
the workings of the W3C, but i am happy to get involved in any way i can.

if the "last call" status and deadlines remain, then i would suggest either:
1 - a "layered" standard such as the MPEG standard, where the level of
conformity is variable, OR
2 - an explicit mechanism for language-specific and theory-specific
variations, whereby standards could evolve for awkward cases (e.g. the
agreement of numbers in spanish, or the specification of declination in a
Fujisaki model of F0).

personally, i prefer (2) since it is less proscriptive and alows mark-up to
develop organically as needed: incorporation into the standard could come
later.

i look forward to further discussion of these points.

regards,
	alex.

Received on Friday, 19 January 2001 19:21:32 UTC