RE: W3C speech synth mark-up - the COST 258 response from Alex.Monaghan@Aculab.com on 2001-01-24 (www-voice@w3.org from January to March 2001)

From: <Alex.Monaghan@Aculab.com>
Date: Wed, 24 Jan 2001 21:22:37 -0000
To: mark.r.walker@intel.com
Cc: www-voice@w3.org
Message-ID: <0AEF0EB21F09D211AE4E0080C82733BF01799C66@mailhost.aculab.com>
	dear mark,
	thanks for your detailed comments. there are a few things i'd like
to respond to - i've put my responses IN CAPITALS below. i've tried to take
all the subsequent discussion into account.
	regards,
			alex.
	
----------------------------------------------------------------------------
-------------------------------------------

> A Response to the W3C Draft Proposal for a Speech Synthesis Mark-up
> Language
> 
> from COST 258, European Co-Operative Action on Improving the Naturalness
> of
> Speech Synthesis
> 
> (http://www.unil.ch/imm/docs/LAIP/COST_258/cost258.htm)
> 
> Editor: Alex Monaghan, Aculab plc, UK (Alex.Monaghan@aculab.com)
> 
> 
> Background
> 
> COST 258 is a consortium of European speech synthesis experts from 17
> countries.
> It is funded by the European Commission, and its aim is to promote
> co-operative research
> to improve the naturalness of synthetic speech. Its members come from both
> academic and
> industrial R&D centres, including at least five providers of comercial
> speech synthesis
> systems. For more information, see the website given above.
> 
> The W3C proposal was discussed at a meeting of COST 258 in September 2000.
> The present
> document collates the subsequent reactions and responses from members.
> It makes both general and specific points about the proposal, and suggests
> several
> modifications. While we are broadly sympathetic to, and supportive of, the
> attempt to
> standardise speech synthesis markup and to increase consistency across
> different
> synthesisers, we feel that there are many obstacles to such an attempt and
> that some of
> these obstacles are currently insurmountable.
> 
> 
> General Points
> 
> 1 - It is not clear who the intended users of this markup language are.
> There are two
> obvious types of possible users: speech synthesis system developers, and
> application
> developers. The former may well be concerned with low-level details of
> timing, pitch and
> pronunciation, and be able to specify these details (F0 targets, phonetic
> transcriptions, pause durations, etc.). The latter group are much more
> likely to be
> concerned with specifying higher-level notions such as levels of boundary,
> degrees of
> emphasis, fast vs slow speech rate, and formal vs casual pronunciation.
> 
> 
> ****The confusion regarding the intended users of SSML is on target.  The
> specification of support for markup at different levels of the synthesis
> process has resulted in a markup language that, as Richard Sproat points
> out, contains element groups addressed at both expert and non-expert
> users.
> This was not by design however, but was simply the result of the voice
> browser committee deliberations.  Having said that, we believe it is
> ultimately advantageous to have a range of elements that enable high, mid,
> and low-level control over synthesized output.
> 
> ****As I mentioned in my previous letter, I believe it is unlikely that
> the
> low-level elements will ever be employed directly by application
> developers,
> even those developers that possess synthesis expertise. Low-level element
> sequences are most likely to be the output generated by SSML authoring
> tools
> employed by both expert and non-expert content authors in the preparation
> of
> synthesize-able material.  It is not unreasonable to imagine that the tool
> users would use the high-level markup in the user interface, while
> 'compiling' the resulting text into XML pitch/phoneme sequences that
> closely
> represented the wishes of the author for the rendered output.  This is
> obviously highly speculative, since there are currently no tools of this
> sort currently available. I can report however, that this usage scenario
> has
> been discussed within Intel and in other venues with other companies
> outside
> of W3C, and has generally been met with approval.
> 
	I BELIEVE YOUR SUGGESTION OF SEPARATING HIGH-LEVEL AND LOW-LEVEL
ELEMENTS EXPLICITLY WILL ADDRESS THIS ISSUE AND MAKE SSML MUCH EASIER TO
USE. I HOPE THIW CAN INCLUDE THE POSSIBILITY OF A PARTICULAR SYSTEM BEING
"HIGH-LEVEL COMPLIANT" AND/OR "LOW-LEVEL COMPLIANT".

> 2 - It is clear that the proposal includes two, possibly three, different
> levels of
> markup. For F0, for instance, there is the <emphasis> tag (which would be
> realised as a
> pitch excursion in most systems), the <prosody contour> tag which allows
> finer control,
> and the low-level <pitch> tag which is a proposed extension. There is very
> little
> indication of best practice in the use of these different levels (e.g.
> which
> type of
> user should use which level), and no explanation of what should happen if
> the different
> levels are combined (e.g. a <pitch contour> specification inside an
> <emphasis>
> environment).
> 
> 
> ****Again, the current form of the specification was largely developed in
> a
> vacuum of information on potential usage models.  As the specification
> moves
> into recommendation form and is slowly adopted within the speech
> application
> community, it is anticipated that best practice and common usage
> information
> will emerge and will be in incorporated in future revisions of the
> specification.  It might be anticipated, based on my response to remark
> #1,
> that the inter-mxing of both high and low-level markup in a given
> application would be unlikely.  I might also remark that you have
> suggested
> that this would be difficult, but not impossible.  Does the possibility of
> mixing high and low elements really represent an **insurmoutable** barrier
> to supporting SSML?  Please provide more detail.
> 
	AS IT STANDS, IT WILL BE VERY DIFFICULT TO PREDICT THE EFFECT OF
MIXING HIGH-LEVEL AND LOW-LEVEL MARK-UP. THE EXPLICIT SEPARATION OF THESE
DIFFERENT LEVELS, AND PERHAPS A CAVEAT ADVISING USERS NOT TO MIX THE TWO,
WOULD SOLVE THIS PROBLEM.

> 3 - The notion of "non-markup behavior" is confusing. On the one hand,
> there
> seems to be
> an assumption that markup will not affect the behaviour of the system
> outside the tags,
> and that the markup therefore complements the system's unmarked
> performance,
> but on the
> other hand there are references to "over-riding" the system's default
> behaviour. In
> general, it is unclear whether markup is intende to be superimposed on the
> default
> behaviour or to provide information which modifies that behaviour.
> 
> 
> ****The effect of markup in supplanting the un-marked behavior of a given
> synthesizer obviously depends on the type and the particular markup level
> being specified.  In some cases, default behavior will be completely
> supplanted, and in other cases markup will result in the superimposition
> of
> relative effects.  The text-analysis guides like 'say-as' are relatively
> unambiguous in that they completely supplant the default synthesizer
> behavior.  Specifying 'emphasis' however, will likely result in relative
> changes in the pitch, duration, and etc of the utterance that will
> manifest
> perceptually as an imposition. 
> 
> 
> The use of the <break> element, for instance, is apparently intended "to
> override the
> typical automatic behavior", but the insertion of a <break> tag may have
> non-local
> repercussions which are very hard to predict. Take a system which assigns
> prosodic
> boundaries stochastically, and attempts to balance the number and length
> of
> units at
> each prosodic level. The "non-markup behavior" of such a system might take
> the input
> "Big fat cigars, lots of money." and produce two balanced units: but will
> the input
> "Big fat <break/> cigars, lots of money." produce three unbalanced units
> (big fat,
> cigars, lots of money), or three more balanced units (big fat, cigars
> lots,
> of money),
> or four balanced units (big fat, cigars, lots of, money), or six
> single-word
> units, or
> something else? Which would be the correct interpretation of the markup?
> 
> 
> ****In this instance, a synthesis text author would reasonably be expected
> to specify more precisely exactly what he/she intended for the resulting
> prosodic units by adding a 'size' or a 'time' attribute to the 'break'
> markup element.  Having said that, I must say I have no idea what the
> common
> default behavior for the above utterance would be, and for that reason it
> would definitely NOT be logical to add markup to alter it.  It seems
> reasonable that even a non-expert synthesis text author would utilize
> common
> sense and realize that specifying markup in sequences where default
> synthesizer behavior is highly variable would be a dicey proposition. 
> 
> ****However, in addressing your concern it seems highly likely that
> high-level elements like 'break' will result in highly variable output
> across different synthesizers, and that will seen be as normal and
> perfectly
> compliant behavior.  Systems like the one you cite above might elect to
> 'balance' the effects of the <break/> insertion over a context larger than
> the specified interval.  Others may elect to completely localize the
> effect,
> even to the detriment of the perceptual quality.  Both behaviors would be
> compliant, and both would essentially convey the coarsely specified intent
> of the markup text author.  I will place verbage within the spec that
> makes
> this point clear.
> 
	IF THE EFFECT OF HIGH-LEVEL TAGS SUCH AS <emphasis> IS TOO HIGHLY
VARIABLE, I IMAGINE THAT MUCH OF THE USEFULNESS OF SSML COULD BE LOST FOR
MANY USERS. ON THE OTHER HAND, I AGREE THAT SPECIFYING A UNIVERSALLY
APPROPRIATE REALISATION OF SUCH TAGS IS CURRENTLY IMPOSSIBLE - THIS IS ONE
REASON WHY NONE OF THE MARKUP SCHEMES PROPOSED PREVIOUSLY HAS BEEN WIDELY
ADOPTED. I STILL FEEL THERE IS A RISK THAT SSML WILL ONLY WORK WELL IF THE
ENTIRE INPUT IS EXTENSIVELY TAGGED (E.G. A <break> ELEMENT AFTER EVERY WORD)
TO ENSURE THAT THE TAGGING DOES NOT PRODUCE UNWANTED EFFECTS: THIS WOULD BE
A PITY.

> 4 - Many of the tags related to F0 presuppose that pitch is represented as
> a
> linear
> sequence of targets. This is the case for some synthesisers, particularly
> those using
> theories of intonation based on the work of Bruce, Ladd or Pierrehumbert.
> However, the
> equally well-known Fujisaki approach is also commonly used in synthesis
> systems, as are
> techniques involving the concetenation of natural or stylised F0 contours:
> in these
> approaches, notions such as pitch targets, baselines and ranges have very
> different
> meanings and in some cases no meaning at all. The current proposal is thus
> far from
> theory-neutral, and is not implementable in many current synthesisers.
> 
> 
> ****This is a significant issue and one that I was completely unaware of
> until you raised it. Obviously, the early days of the SSML requirements
> phase were dominated (apparently) by firms possessing synthesizers
> modeling
> intonation with the former approach.  I would welcome any proposal that
> expanded the ability of the low-level elements to specify intonation in a
> less theory-biased manner.  
> 
> ****In answering Richard Sproat's specific concern about long-unit
> synthesizers, I will propose that the decision by any synthesis engine
> provider to support the SSML specification is probably ultimately driven
> by
> economics, not technology.  Long-unit synthesizers like AT&T NextGen for
> example, are very large and are deployed in tightly confined application
> environments like voice portal providers.  The synthesis text authors are
> likely to be employed by the portal itself.  The text is authored
> specifically for the portal engine, and the authors are likely to be very
> familiar with the performance of the system.  Finally, the enormous size
> of
> the concatenative database means that much of the ability to produce very
> specific and expressive speech sequences already resides in the system.
> The
> economic benefits of implementing SSML are therefore probably minimal for
> engine providers of this type.
> 
> 
> 
> 
> 5 - The current draft does not make clear what will be in "the standard"
> and
> what will
> be optional or future extensions. The <lowlevel> tag is the most obvious
> example, but
> various other tags mentioned above are not universally implementable and
> would therefore
> prevent many systems from complying with the standard.
> 
> ****The 'future study' elements are generally denoted as such, but I will
> increase their visibility as non-spec items.  In any case, all future
> study
> elements are to be dropped before publication as a Candidate
> Recommendation.
> 
> ****Please specify what tags, in addition to F0 elements, would not be
> universally implementable.
> 
	THE DIVISION BETWEEN THE STANDARD AND FUTURE EXTENSIONS IS NOW
CLEAR. THE TAGS WHICH STILL CAUSE PROBLEMS ARE THE <break>, <emphasis> AND
VARIOUS F0 TAGS. I WILL DISCUSS THESE BELOW.

> 6 - There is no provision for local or language-specific additions, such
> as
> different
> classes of abbreviations (e.g. the distinction between a true acronym such
> as DEC and an
> abbreviation such as NEC), different types of numbers (animate versus
> inanimate in many
> languages), or the prosodic systems of tone languages. Some specific
> examples are
> discussed below, but provision for anything other than English is minimal
> in
> the current
> proposal.
> 
> ****The 'acronym' attribute of 'say-as' will be modified to allow the
> explicit specification of true acronyms and abbreviations. See below.  The
> notion of language-specific additions is intriguing and I would welcome a
> proposal for a general markup method for doing this.  I would also welcome
> a
> concrete proposal to deal with tone languages.
> 
	MOST OF THE THINGS I HAD IN MIND CAN BE HANDLED BY EXTENSIONS TO THE
<say-as> TAG, SUCH AS THOSE PROPOSED BY RICHARD SPROAT. THEY COULD ALSO BE
HANDLED BY <say-as sub>, BUT THIS IS NOT PARTICULARLY ELEGANT.

	I DON'T HAVE A CLEAR PROPOSAL FOR ALLOWING LANGUAGE-SPECIFIC
ADDITIONS, EXCEPT TO SUGGEST THAT THE SET OF ELEMENTS AND ATTRIBUTES SHOULD
NOT BE CLOSED AT THIS POINT. I IMAGINE THAT IF SSML IS WIDELY USED IT WILL
EVOLVE - I'D BE INTERESTED TO KNOW IF THERE IS ANY PLAN TO MANAGE SUCH
EVOLUTION. SOME SPECIFIC POINTS REGARDING LANGUAGE-SPECIFIC ADDITIONS ARE
MADE BELOW.

	ON TONE LANGUAGES, I THINK RICHARD'S SUGGESTION OF ALLOWING A
PARTIAL PHONETIC TRANSCRIPTION IS EXTREMELY USEFUL (<phoneme ph="1-2"
type="tone">) - IT COULD BE LEFT TO INDIVIDUAL SYSTEM DEVELOPERS OR LANGUAGE
COMMUNITIES TO DETERMINE WHICH OF THE AVAILABLE IPA SYMBOLS THEY WISHED TO
ADOPT, AND WHETHER SINGLE-INTEGER OR MULTI-INTEGER DESCRIPTIONS WERE
ALLOWED.

> Specific Tags
> 
> <sayas> - Several categories could be added to this tag, including
> telephone
> numbers,
> credit card numbers, and the distinction between acronyms (DEC, DARPA,
> NASA)
> and
> letter-by-letter abbreviations (USA, IBM, UK).
> 
> 
> ****'telephone' is a 'say-as' attribute value.  I'm unable to envision an
> example where <say-as type="number:digits"> would not suffice for credit
> card numbers.  
> 
	THE IDEA HERE WAS THAT CREDIT CARD (AND OTHER) NUMBERS ARE USUALLY
READ IN GROUPS. THE ABILITY TO SPECIFY THE GROUPINGS, USING WHITE SPACE,
WOULD PROBABLY SUFFICE. THE OTHER ISSUES HAVE BEEN ADDRESSED.

> ****I honestly don't know how 'acronym' got so mangled, but somehow it
> did.
> I think that at some point, it was pointed out that 'acronym' would be the
> expected default behavior of a synthesizer encountering something like
> 'DEC'
> - that is, the synthesizer would, in the absense of markup, attempt to
> prounce it as a word.  That's why 'literal' was kept as the overriding
> effect of markup - but the attribute value ended up being named 'acronym'
> instead of 'literal' or 'spell-out'!!!!!
> 
> ****Accordingly, as Richard Sproat has proposed, 
> <say-as type="spell-out"> USA </say-as> 
> and 
> <say-as type="acronym"> DEC </say-as> 
> will replace the current 'acronym' attribute value.
> 
> 
> In languages with well-developed morphology, such as Finnish or Spanish,
> the
> pronunciation of numbers and abbreviations depends not only on whether
> they
> are ordinal
> or cardinal but also on their gender, case and even semantic properties.
> These are often
> not explicit, or even predictable, from the text. It would be advisable to
> extend the
> <sayas> tag to include an optional "morph" value to hold such information.
> 
> 
> ****I am open to proposals in this area, but there would need to be
> substantially more data on it's potential usefulness.  I agree with
> Richard
> Sproat that successful utilization of such a tag might require linguistic
> expertise that would not likely be possessed by portal and web text
> authors
> who I believe constitute the majority of potential users of this
> specification.  I would also wonder why markup would be required to
> resolve
> ambiguity in the specification a property that would likely be already
> embodied as a part of the default knowledge base of a Finnish, Spanish,
> etc
> synthesizer.
> 
	THIS COULD BE A LANGUAGE-SPECIFIC EXTENSION, BUT I TAKE RICHARD'S
POINT THAT THE AVERAGE USER IS NOT ABLE TO SPECIFY, SAY A GROUP-3 ABLATIVE
PLURAL IN SWAHILI. THE <say-as sub> OPTION SEEMS TO BE THE BEST ONE FOR NOW,
UNLESS RICHARD'S PROPOSAL IS ACCEPTED.

> <voice> - It seems unnecessary to reset all prosodic aspects to their
> defaults when the
> voice changes. This prevents the natural-sounding incorporation of direct
> speech using a
> different voice, and also makes the reading of bilingual texts (common in
> Switzerland,
> Eastern Europe, the Southern USA, and other exotic places) very awkward.
> ALthough
> absolute values cannot be carried over from voice to voice, it should be
> possible to
> tranfer relative values (slow/fast, high/medium/low, etc.) quite easily.
> 
> 
> ****Usage note #3 suggests a simple method, using a style sheet, to
> preserve
> prosodic properties across voice changes, both fundamental and relative.
> My
> own experience with XML confirms that is quite straightforward.  In
> generating the specification we had to chosse a default behavior, and
> 'reset' was selected as the most commonly desired result of an imbedded
> voice change.  I will take a note to generate an example style sheet that
> demonstrates how prosody may be preserved and include it in the next rev.
> 
	THANK YOU.

> <break> - Some languages have a need for more levels of prosodic boundary
> below a minor
> pause, and some applications may require boundaries above the paragraph
> level. It would
> be advisable to add an optional "special" value for these cases.
> 
> ****I'm not sure if I fully understand the issue, but I get the (weak)
> impression that the problem may lie in the specification of the prescribed
> effect on perceptual quality when a <break/> element is inserted to
> override
> default behavior in an utterance where default synthesizer behavior might
> be
> expected to vary significantly across different synthesizer engines.  If
> this is the case, then please reference my remarks above.  If not, please
> provide more detail and examples.
> 
> ****In any case, the 'special' attribute is a rather vacuous term for the
> complex cases you cite above.  Please provide more examples.  Perhaps a
> suite of new attributes for these cases would be appropriate.
> 
	THE PROBLEM IS SIMPLY THAT 4 OR 5 LEVELS OF BREAK ARE NOT ENOUGH FOR
ALL LANGUAGES AND ALL APPLICATIONS. IT WOULD BE NICE IF THIS WERE MORE
OPEN-ENDED, TO ALLOW FOR LOWER AND HIGHER LEVELS OF BREAKS.

> <prosody> - There is currently no provision for languages with lexical
> tone.
> These
> include many commercially important languages (e.g. Chinese, Swedish,
> Norwegian), as
> well as most of the other languages of the world.
> 
> ****You are correct that tonal languages are not specifically supported,
> although there were informal proposals to do so.  I re-iterate my
> willingness to seriously review any formal proposal to exapnd SSML to
> tonal
> languages.
> 
> 
> 
> <rate> - "Words per minute" values are not reliably implementable in any
> current
> synthesiser, although they may be a readily understandable measure of
> approximate speech
> rate. It is perhaps equally important to be able to specify the dynamics
> of
> speech rate
> - accelerations, decelerations, constancies.
> 
> ****I think it is logical to assume that the 'rate' attribute of the
> 'prosody' element would only be employed by a markup-wielding author to
> alter speaking rate over long sequences.  I think it is also logical to
> assume that, even though a given synthesizer may not possess an
> algorithmic
> representation of WPM, it would have some intrinsic notion of speech rate
> that could be modified to render the intent of the non-expert markup
> author.
> And not to sound like a broken record, but I would welcome a concrete
> proposal for adding accelerations, etc to the 'prosody' element.
> 
	WE WILL THINK ABOUT THIS, BUT NOT BEFORE THE END OF THE MONTH!


> <audio> - Multimodal systems (e.g. animations) are likely to require
> precise
> synchronisation of audio, images and other resources. This may be beyond
> the
> scope of
> the proposed standard, but could be included in the <lowlevel> tag.
> 
> ****Multimodal synchronization is definitely desireable but it is indeed
> beyond the scope of SSML.  That was the original intent of the 'mark'
> element, which I am now recommending be removed.  The reason is that this
> capability has been superseded for the rapidly evolving SMIL markup
> standard. Aron Cohen (head of SMIL and an Intel colleague) informs me that
> SSML elements imbedded in a SMIL sequence specification could form the
> basis
> of animation/synthesis application.  I will continue to monitor SMIL with
> an
> eye to eventually producing an example fragment demonstrating this
> capability.
> 
	FAIR ENOUGH.

> Suggested Modifications
> 
> 
> 1 - Distinguish clearly between tags intended for speech synthesis
> developers and tags
> intended for application designers. Perhaps two separate markup languages
> (high-level
> and low-level) should be specified.
> 
> ****I support this proposal and will work with COST 258 to further refine
> it.
> 
	I LOOK FORWARD TO THIS.

> 2 - Clarify the intended resolution of conflicts between high-level and
> low-level
> markup, or explain the dangers of using both types in the same document.
> 
> ****I think my remarks on the typical user profile above partially
> addressed
> this.  I also think partitioning the spec as suggested in #1 may obviate
> this need.  It might also be noted that in the absence of real data on
> common useage, it may be a little presumptious to assume that mixing high
> and low-level elements within the same document is impossible or even
> unadviseable.  I stand by my contention however, that is probably unlikely
> to occur.
> 
	SOME CAVEAT WOULD BE USEFUL.

> 3 - Clarify the intended effect of tags on the default behaviour of
> synthesis systems.
> Should they be processed BEFORE the system performs its "non-markup
> behavior", or AFTER
> the default output has been calculated? Does this vary depending on the
> tag?
> 
> ****I think this is absolutely unspecifiable and moreover largely misses
> the
> point of SSML.  Any synthesizer engine developer motivated to support SSML
> will certainly implement each markup element in a way appropriate for that
> synthesizer.  Which of the above routes selected will depend solely on how
> the specified markup effect is most efficiently executed.  Given that, the
> thing we should focus on is whether each of the element descriptions is
> sufficiently detailed to allow a knowledgeable engine developer to fully
> discern the intended effect on rendered output.
> 
	THE CAVEAT ABOVE SHOULD HELP HERE TOO.

> 4 - Revise the F0 tags to allow for theory-neutral interpretation: if this
> is bnot done,
> the goal of interoperability across synthesis platforms cannot be
> achieved.
> 
> ****I stand ready to receive a proposal on how best to go about this.
> 
	AGAIN, WE WILL THINK ABOUT THIS AFTER JANUARY.

> 5 - Clarify what constitutes the minimum core standard, and what are
> optional or future
> extensions.
> 
> ****Per W3C rules, the future study sections are due to be dropped as SSML
> enters candidate phase.  What remains will constitute the core
> specification.
> 
	PERFECT.

> 6 - Provide a mechanism for extending the standard to include unforeseen
> cases,
> particularly language-specific or multilingual requirements.
> 
> ****I stand ready to receive a proposal on how best to go about this.
> 
	SEE ABOVE.

> <sayas> - Add the categories mentioned above, plus an optional "morph"
> value
> to hold
> agreement information.
> 
> ****See my comments above on 'say-as'
> 
	SEE ABOVE.

> <voice> - Allow the option of retaining relative prosodic attributes
> (pitch,
> rate, etc.)
> when the voice is changed.
> 
> ****See my comments above on the use of stylesheets to preserve prosody.
> 
	PERFECT.

> <break> - Add an optional "special" value to allow language-specific and
> application-
> specific extensions.
> 
> ****See comments above.
> 
	SEE ABOVE.

> <prosody> - add an optional "tone" value.
> 
> ****Please provide more details on what, exactly would be meant by
> <prosody
> tone="???">, which is what I believe your are attempting to propose.
> 
	SEE ABOVE.

> <audio> - Consider a <lowlevel> extension to allow synchronisation of
> speech
> with other
> resources.
> 
> ****Superseded by SMIL.  See my remarks above
> 
	PERFECT.


> -----Original Message-----
> From: Alex.Monaghan@Aculab.com [mailto:Alex.Monaghan@Aculab.com]
> Sent: Monday, January 08, 2001 7:42 AM
> To: kuansanw@microsoft.com; mark.r.walker@intel.com
> Cc: eric.keller@imm.unil.ch
> Subject: RE: W3C speech synth mark-up - the COST258 response
> 
> 
> > dear mark and kuansan,
> > 
> thanks for your emails.
> 
> i do think that there are several unresolved issues in the "last call"
> document, and that these could make the proposal unimplementable on most
> current synthesis platforms.
> 
> i also believe that the W3C standards have an obligation to be universal
> with respect to the language to be synthesised and the theoretical model
> to
> be used. unfortunately, the current proposal does not fulfil this
> obligation. the reasons for this are understandable: it derives from SABLE
> and JSML, which were quite narrow in their view of speech synthesis.
> 
> i would support the suggestion that the "last call" status be recinded. if
> there is a lack of enthusiasm for further work on this proposal (this has
> been indicated by several people, and is borne out by the email archive),
> then i can offer an injection of such enthusiasm. i don't fully understand
> the workings of the W3C, but i am happy to get involved in any way i can.
> 
> if the "last call" status and deadlines remain, then i would suggest
> either:
> 1 - a "layered" standard such as the MPEG standard, where the level of
> conformity is variable, OR
> 2 - an explicit mechanism for language-specific and theory-specific
> variations, whereby standards could evolve for awkward cases (e.g. the
> agreement of numbers in spanish, or the specification of declination in a
> Fujisaki model of F0).
> 
> personally, i prefer (2) since it is less proscriptive and alows mark-up
> to
> develop organically as needed: incorporation into the standard could come
> later.
> 
> i look forward to further discussion of these points.
> 
> regards,
> 	alex.
Received on Wednesday, 24 January 2001 16:22:51 UTC