RE: Consolidated comments on SSML from Daniel Burnett on 2003-08-12 (www-voice@w3.org from July to September 2003)

From: Daniel Burnett <burnett@nuance.com>
Date: Tue, 12 Aug 2003 11:20:20 -0700
To: <w3c-i18n-ig@w3.org>
Cc: <www-voice@w3.org>
Message-ID: <ED834EE1FDD6C3468AB0F5569206E6E91AF317@MPB1EXCH02.nuance.com>
Dear Martin (and the Internationalization Working Group),

We would like to thank you again for your very thorough review
of the SSML specification.

This email contains almost all of the remaining
responses to your original comments, as well as replies to your
review of items to which we responded earlier.

We have yet to send you initial responses on 28, 39, and 44.
We have yet to respond to your replies on 1 and 6.
We are still awaiting your reply to 54.

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance.  If you
believe you will be unable to complete your review within this time
frame, we would appreciate within the next two weeks an estimate as
to when you might be able to complete your review.
For any "second round" responses (where we are responding to your replies
to our earlier responses) with which you disagree, we would rather have
you just note them as such so that we can schedule a live (telecon)
discussion with you to get them resolved more quickly.

Once again, thank you for your thorough and considered input on
the specification.

-- Dan Burnett

Synthesis Team Leader, VBWG

[VBWG responses follow]

[2] We would like to discuss this one with you live. In particular,
 we would like to understand why you believe the xml:lang attribute
 provides insufficient information to address this.

[4] Thank you for your suggestion. However, we disagree with the
 addition of "with voice browsers" since it is too limiting to
 restrict the use of the Voice Browser Working Group's specifications
 only to voice browsers. There are already use cases for the SRGS
 and SSML specifications outside of VoiceXML (for example, handwriting
 recognition for SRGS and MRCP/ speechsc for both). We currently intend
 to keep the first sentence as it is.

[8] Accepted.

[10] Accepted.  Thank you for your suggestions. We will apply some
 of them to this section.

[11] Accepted with changes.  Instead of this change, we will add
 "(which may be language dependent)" after the word "dictionary".

[12] Although it is clear that you wish to see examples in the
 specification of pronunciations that might not be clear to
 speakers of American English, it is unclear why you believe
 they should be included in section 1.2 which describes the
 process of synthesis itself. Can you please provide a proposal
 with detailed text changes? We would prefer that you include
 pronuciations as well (in IPA, as suggested by the specification)
 since you appear to have specific pronunciations in mind.

[20] Rejected.  
 As to why we require the <voice> element: Changing the language
 has strong implications for output voice change in SSML. We
 found in the end that because of the text normalization, prosody,
 etc. changes upon changing language in SSML we wanted clear
 author awareness that changing xml:lang was likely to change
 the voice and other speaking characteristics. We permitted
 xml:lang on the <p> and <s> elements only because those are
 the places where changes in such characteristics are both most
 common and least disruptive.

 Regarding the <desc> element and the use cases you've presented:
 The <desc> element is for a description, not for transcription.
 For your JFK example, this element would contain "President
 Kennedy speaking in Berlin" or "President Kennedy's famous
 German language gaffe", etc. depending on the purpose of the
 audio. The audio might even be music, for example!

 For your language teaching, etc. examples, such alternates should
 go in the content of the <audio> element itself.

[22] Accepted with changes.  The document is already in UTF-8,
 the default for both XML documents and W3C specifications. We
 will leave the Italian example in Latin-1. For everything else
 we will explicitly set the encoding to UTF-8. In the <phoneme>
 example, we will include the IPA characters in a comment so
 browsers that can display them will. Because the UTF-8
 representation of these symbols is multi-character, they're
 hard to modify, cut and paste, etc. For that reason we'll leave
 the entity escape versions in the code itself. We will also
 comment that one would normally use the UTF-8 representation
 of these symbols and explain why we put them in a comment.

[23]  Accepted.  Thanks for the Japanese text. We will
 incorporate it into the example.

[42] Accepted.  Any arbitrary string value is permitted. The only
 one with a predefined meaning is "ipa". Others are vendor-specific
 and depend upon the underlying pronunciation model set used by
 the vendor. There are quality implications to requiring only IPA,
 so we permit other alphabets.
 We will add text clarifying this behavior.
 Also, the Working Group is considering the development of a
 standardized lexicon format that might address the issue of
 quality with portability.

[43] Rejected.  The term "alphabet" is commonly used in this
 context within the Speech Recognition/Synthesis community.
 We do not believe a change is appropriate.

[45] Accepted with changes.  We will change the example as described
 in the response to point 22.

[46] Rejected.  Ultimately this problem can be considered to be
 labelling an arbitrary chunk of SSML for uses other than audio
 speech production. We do consider this functionality to be useful
 enough to have in the specification in some form today (as <sub>
 or something else). The current approach meets the accessibility
 needs but does not permit markup of the spoken form. Since these
 elements are primarily intended to be used only for short phrases
 (such as W3C, Mr. Smyth, etc.), we have not in practice encountered
 any significant limitations in our use of the existing elements.
 Changing the current elements would likely result in other changes
 throughout the specification, something we are loathe to do in this
 version of the specification without a stronger demonstration of
 practical need. We will revisit the topic of how best to achieve
 this labelling functionality in the next version of SSML.

[47] Rejected.  You have discussed SSML as if it is a module for XHTML,
 but it isn't. Arbitrary embeddings of other markup languages are
 ignored (see answer to item 76). As such, Ruby is essentially not
 permitted in SSML. Since Ruby is for visual rendering, it might make
 more sense to pre-process your XHTML document to generate valid SSML.
 We intend to consider modularity in future versions of SSML and would
 of course welcome your input.

[48] Rejected.  We have not seen enough general interest to warrant
 this change.

[51] The last sentence in this paragraph describes what happens: 

  It is an error if the processor decides it does not have a voice
  that sufficiently matches the above criteria.

 where error is defined in section 1.5.
 The short answer is that it is up to the processor to decide whether
 or not it has a realistic way of rendering the Japanese text. For
 example, it may be appropriate for the application to attempt to
 pronounce Japanese with an English accent by using its best mapping
 to English phonemes (similar to an example you gave in point 52).
 We will change the words "same language but different region" to be
 "a variant or dialect of the same language".

[52] Accepted.  We will describe this situation in the document and
 provide an example.

[58] We will correct the typo. Can you further explain your
 concern with the line break? We do not understand the problem.

[59] Accepted with changes.  We like the existing text and will keep
 it. However, we will also add (upfront) a description based on
 priorities as you suggest.

[60] Accepted.  Inserting any element adds a lexical boundary,
 so while it is acceptable to insert a break in the middle of a
 word or phrase, this will create a new lexical boundary,
 effectively splitting the one word or phrase into two. We will
 clarify the relationship between words and tokens in the
 Introduction and that breaking one token into multiple tokens
 will likely affect how the processor treats it. A simple English
 example is "cup<break/>board"; the processor will treat this as
 the two words "cup" and "board" rather than as one word with a
 pause in the middle.

[66] Accepted.

[68] Accepted with changes.  Because of the difficulty in accurately
 defining the meaning of words per minute, syllables per minute, or
 phonemes per minute across all possible languages, we have decided
 to replace such specification with a number that acts as a multiplier
 of the default rate. For example, a value of 1 means a speaking rate
 equal to the default rate, a value of 2 means a speaking rate twice
 the default rate, and a value of 0.5 means a speaking rate of half
 the default rate. The default rate is processor- specific and will
 usually vary across both languages and voices. Percentage changes
 relative to the current rate are still permitted. Note that the effect
 of setting a specific words per minute rate (for languages for which
 that makes sense) can be achieved by explicitly setting the duration
 for the contained text via the duration attribute of the <prosody>
 element. The duration attribute can be used in this way for all
 languages and is therefore the preferred way of precisely controlling
 the rate of speech when that is desired.

[69] Maximum volume does not equal shouting. Shouting is actually
 a factor of several prosodic changes, only one of which is volume.
 Our internal poll determined that maximum volume was the default
 for most synthesis processors. The assumption is that you can a)
 reduce the volume within SSML and b) set the final true volume to
 anything you want through whatever general audio controls your
 audio system (PC volume control, speaker knob) has available.

[72] Accepted.  This is a general problem that applies to all of
 the specifications from the Voice Browser Working Group. We will
 address it in a consistent manner across all of our specifications.

[74] It is occasionally the case that no media type is available;
 some examples are an HTTP request that does not return a media
 type and a local file access. The "type" attribute can be used
 in this case to indicate the type of the document. Also, some
 schemes provide for content negotiation when multiple valid
 documents (in different formats) are available, in which case
 the "type" attribute functions as a preferred type indicator.

[76] Accepted.  Good point about xml:lang. We will modify the
 text to indicate that everything in our schema (including
 xml:lang, xml:base, etc.) is to be retained in the conversion
 and that all other non-synthesis namespace elements and their
 contents should be removed.

[80] Accepted.

[82] Accepted.  Good suggestion. This is a general problem that
 applies to all of the specifications from the Voice Browser
 Working Group. We will address it in a consistent manner across
 all of our specifications by providing the most appropriate and
 relevant references at the time of publication.

[83] See the <prosody> element (section 2.1.5). In short,
 it is a symbol set for representing the phonemic or phonetic
 units of a human language.

[85] No action requested for this document.

[87] Accepted.  As with item 86, this is old text. We will correct this.

[89] Rejected.  This is a new request well outside the timeframe for
 comments on this specification. We agree with the principle and will
 happily consider this request for a future version of SSML beyond 1.0.

-----Original Message-----
From: Martin Duerst [mailto:duerst@w3.org]
Sent: Friday, January 31, 2003 7:50 PM
To: www-voice@w3.org
Cc: w3c-i18n-ig@w3.org
Subject: Consolidated comments on SSML



Dear Voice Browser WG,

These are the Last Call comments on Speech Synthesis
Markup Language (http://www.w3.org/TR/speech-synthesis/)
from the Core Task Force of the Internationalization (I18N) WG.
Please make sure that you send all emails regarding these
comments to w3c-i18n-ig@w3.org, rather than to me personally
or just to www-voice@w3.org (to which we are not subscribed).

These comments are based on review by Richard Ishida and myself and
have been discussed and approved the last I18N Core TF teleconference.
They are ordered by section and numbered for easy reference.
We have not classified these issues into editorial and substantial,
but we think that it should be clear from their discription.

General:
[01]  For some languages, text-to-speech conversion is more difficult
       than for others. In particular, Arabic and Hebrew are usually
       written with none or only a few vowels indicated. Japanese
       often needs separate indications for pronunciation.
       It was no clear to us whether such cases were considered,
       and if they had been considered, what the appropriate
       solution was.
       SSML should be clear about how it is expected to handle these
       cases, and give examples. Potential solutions we came up with:
       a) require/recommend that text in SSML is written in an
       easily 'speakable' form (i.e. vowelized for Arabic/Hebrew,
       or with Kana (phonetic alphabet(s)) for Japanese. (Problem:
       displaying the text visually would not be satisfactory in this
       case); b) using <sub>; c) using <phoneme> (Problem: only
       having IPA available would be too tedious on authors);
       d) reusing some otherwise defined markup for this purpose
       (e.g. <ruby> from http://www.w3.org/TR/ruby/ for Japanese);
       e) creating some additional markup in SSML.

General: Tagging for bidirectional rendering is not needed
[02]  for text-to-speech conversion. But there is some provision
       for SSML content to be displayed visually (to cover WAI
       needs). This will not work without adequate support of bidi
       needs, with appropriate markup and/or hooks for styling.

General: Is there a tag that allows to change the language in
[03]  the middle of a sentence (such as <html:span>)? If not,
       why not? This functionality needs to be provided.


Abstract: 'is part of this set of new markup specifications': Which set?
[04]

Intro: 'The W3C Standard' -> 'This W3C Specification'
[05]

Intro: Please shortly describe the intended uses of SSML here,
[06]   rather than having the reader wait for Section 4.


Section 1, para 2: Please shortly describe how SSML and Sable are
[07]  related or different.


1.1, table: 'formatted text' -> 'marked-up text'
[08]

1.1, last bullet: add a comma before 'and' to make
[09]  the sentence more readable


1.2, bullet 4, para 1: It might be nice to contrast the 45 phonemes
[10] in English with some other language. This is just one case that
      shows that there are many opportunities for more internationally
      varied examples. Please take any such oppurtunities.

1.2, bullet 4, para 3: "pronunciation dictionary" ->
[11] "language-specific pronunciation dictionary"

1.2:  How is "Tlalpachicatl" pronounced? Other examples may be
[12]  St.John-Smyth (sinjen-smaithe) or Caius College
       (keys college), or President Tito (sutto) [president of the
       republic of Kiribati (kiribass)


1.1 and 1.5: Having a 'vocabulary' table in 1.1 and then a
[13] terminology section is somewhat confusing.
      Make 1.1 e.g. more text-only, with a reference to 1.5,
      and have all terms listed in 1.5.

1.5: The definition of anyURI in XML Schema is considerably wider
[14] than RFC 2396/2732, in that anyURI allows non-ASCII characters.
      For internationalization, this is very important. The text
      must be changed to not give the wrong impression.

1.5 (and 2.1.2): This (in particular 'following the
[15]  XML specification') gives the wrong impression of where/how
      xml:lang is defined. xml:lang is *defined* in the XML spec,
      and *used* in SSML. Descriptions such as 'a language code is
      required by RFC 3066' are confusing. What kind of language code?
      Also, XML may be updated in the future to a new version of RFC
      3066, SSML should not restrict itself to RFC 3066
      (similar to the recent update from RFC 1766 to RFC 3066).
      Please check the latest text in the XML errata for this.


2., intro: xml:lang is an attribute, not an element.
[16]

2.1.1, para 1: Given the importance of knowing the language for
[17] speech synthesis, the xml:lang should be mandatory on the root
      speak element. If not, there should be a strong injunction to use it.

2.1.1: 'The version number for this specification is 1.0.': please
[18] say that this is what has to go into the value of the 'version'
      attribute.


2.1.2., for the first paragraph, reword: 'To indicate the natural
[19] language of an element and its attributes and subelements,
      SSML uses xml:lang as defined in XML 1.0.'

The following elements also should allow xml:lang:
[20] - <prosody> (language change may coincide with prosody change)
      - <audio> (audio may be used for foreign-language pieces)
      - <desc> (textual description may be different from audio,
           e.g. <desc xml:lang='en'>Song in Japanese</desc>
      - <say-as> (specific construct may be in different language)
      - <sub>
      - <phoneme>

2.1.2: 'text normalization' (also in 2.1.6): What does this mean?
[21] It needs to be clearly specified/explained, otherwise there may
      be confusion with things such as NFC (see Character Model).

2.1.2, example 1: Overall, it may be better to use utf-8 rather than
[22] iso-8859-1 for the specification and the examples.

2.1.2, example 1: To make the example more realistic, in the paragraph
[23] that uses lang="ja" you should have Japanese text - not an English
      transcription, which may not use as such on a Japanese text-to-speech
      processor. In order to make sure the example can be viewed even
      in situations where there are no Japanese fonts available, and
      can be understood by everybody, some explanatory text can provide
      the romanized from. (we can help with Japanese if necessary)

2.1.2, 1st para after 1st example: Editorial.  We prefer "In the
[24] case that a document requires speech output in a language not
      supported by the processor, the speech processor largely determines
      the behavior."

2.1.2, 2nd para after 1st example: "There may be variation..."
[25] Is the 'may' a keyword as in rfc2119? Ie. Are you allowing
      conformant processors to vary in the implementation of xml:lang?
      If yes, what variations exactly would be allowed?


2.1.3: 'A paragraph element represents the paragraph structure'
[26] -> 'A paragraph element represents a paragraph'. (same for sentence)
      Please decide to either use <p> or <paragraph>, but not both
      (and same for sentence).


2.1.4: <say-as>: For interoperability, defining attributes
[27] and giving (convincingly useful) values for these attributes
      but saying that these will be specified in a separate document
      is very dangerous. Either remove all the details (and then
      maybe also the <say-as> element itself), or say that the
      values given here are defined here, but that future versions
      of this spec or separate specs may extend the list of values.
      [Please note that this is only about the attribute values,
       not the actual behavior, which is highly language-dependent
       and probably does not need to be specified in every detail.]

2.1.4, interpret-as and format, 6th paragraph: requirement that
[28] text processor has to render text in addition to the indicated
      content type is a recipe for bugwards compatibility (which
      should be avoided).

2.1.4, 'locale': change to 'language'.
[29]

2.1.4: How is format='telephone' spoken?
[30]
2.1.4: Why are there 'ordinal' and 'cardinal' values for both
[31]   interpret-as and format?

2.1.4 'The detail attribute can be used for all say-as content types.'
[32]   What's a content type in this context?

2.1.4 detail 'strict': 'speak letters with all detail': As opposed
[33]  to what (e.g. in that specific example)?

2.1.4, last table: There seem to be some fixed-width aspects in the
[34]   styling of this table. This should be corrected to allow complete
        viewing and printing at various overall widths.

2.1.4, 4th para (and several similar in other sections):
[35]  "The say-as element can only contain text." would be easier
       to understand; we had to look around to find out whether the
       current phrasing described an EMPTY element or not.

2.1.4. For many languages, there is a need for additional information.
[36]   For example, in German, ordinal numbers are denoted with a number
       followed by a period (e.g. '5.'). They are read depending on case
       and gender of the relevant noun (as well as depending on the use
       of definite or indefinite article).

2.1.4, 4th row of 2nd table: I've seen some weird phone formats, but
[37]  nothing quite like this! Maybe a more normal example would NOT
       pronounce the separators. (Except in the Japanese case, where the
       spaces are (sometimes) pronounced (as 'no').)


2.1.5, <phoneme>:
[38]  It is unclear to what extent this element is designed for
       strictly phonemic and phonetic notations, or also (potentially)
       for notations that are more phonetic-oriented than usual writing
       (e.g. Japanese kana-only, Arabic/Hebrew with full vowels,...)
       and where the boundaries are to other elements such as <say-as>
       and <sub>. This needs to be clarified.

2.1.5 There may be different flavors and variants of IPA (see e.g.
[39]  references in ISO 10646). Please make sure it is clear which
       one is used.

2.1.5 IPA is used both for phonetic and phonemic notations. Please
[40]  clarify which one is to be used.

2.1.5 This may need a note that not all characters used in IPA are
[41]  in the IPA block.

2.1.5 This seems to say that the only (currently) allowed value for
[42]  alphabet is 'ipa'. If this is the case, this needs to be said
       very clearly (and it may as well be defined as default, and
       in that case the alphabet attribute to be optional). If there
       are other values currently allowed, what are they? How are
       they defined?

2.1.5 'alphabet' may not be the best name. Alphabets are sets of
[43]  characters, usually with an ordering. The same set of characters
       could be used in totally different notations.

2.1.5 What are the interactions of <phoneme> for foreign language
[44]  segments? Do processors have to handle all of IPA, or only the
       phonemes that are used in a particular language? Please clarify.

2.1.5, 1st example:  Please try to avoid character entities, as it
[45] suggests strongly that this is the normal way to input this stuff.
      (see also issue about utf-8 vs. iso-8859-1)


2.1.5 and 2.1.6: The 'alias' and 'ph' attributes in some
[46]  cases will need additional markup (e.g. for fine-grained
       prosody, but also for additional emphasis, bidirectionality).
       This would also help tools for translation,...
       But markup is not possible for attributes. These attributes
       should be changed to subelements, e.g. similar to the <desc>
       element inside <audio>.

2.1.5 and 2.1.6: Can you specify a null string for the ph and alias
[47] attributes? This may be useful in mixed formats where the
      pronunciation is given by another means, e.g. with ruby annotation.


2.1.6 The <sub> element may easily clash or be confused with <sub>
[48]  in HTML (in particular because the specification seems to be
       designed to allow combinations with other markup vocabularies
       without using different namespaces). <sub> should be renamed,
       e.g. to <subst>.

2.1.6 For abbreviations,... there are various cases. Please check
[49]  that all the cases in
       http://lists.w3.org/Archives/Member/w3c-i18n-ig/2002Mar/0064.html
       are covered, and that the users of the spec know how to handle
       them.

2.1.6, 1st para: "the specified text" ->
[50]   "text in the alias attribute value".


2.2.1, between the tables: "If there is no voice available for the
[51]  requested language ... select a voice ... same language but different
       region..."  I'm not sure this makes sense.  I could understand that
       if there is no en-UK voice you'd maybe go for an en-US voice - this
       is a different DIALECT of English.  If there are no Japanese voices
       available for Japanese text, I'm not sure it makes sense to use an
       English voice. What happens in this situation?

2.2.1 It should be mentioned that in some cases, it may make sense to have
[52]  a short piece of e.g. 'fr' text in an 'en' text been spoken by
       an 'en' text-to-speech converter (the way it's often done by
       human readers) rather than to throw an error. This is quite
       different for longer texts, where it's useless to bother an
       user.

2.2.1: We wonder if there's a need for multiple voices (eg. A group of kids)
[53]

2.2.1, 2nd example: You should include some text here.
[54]

2.2.1 The 'age' attribute should explicitly state that the integer
[55]  is years, not something else.

2.2.1 The variant attribute should say what it's index origin is
[56]  (e.g. either starting at 0 or at 1)

2.2.1 attribute name: (in the long term,) it may be desirable to use
[57]  an URI for voices, and to have some well-defined format(s)
       for the necessary data.

2.2.1, first example (and many other places): The line break between
[58]  the <voice> start tag and the text "It's fleece was white as snow."
       will have negative effects on visual rendering.
       (also, "It's" -> "Its")

2.2.1, description of priorities of xml:lang, name, variant,...:
[59]  It would be better to describe this clearly as priorities,
       i.e. to say that for voice selection, xml:lang has highest
       priority,...


2.2.3 What about <break> inside a word (e.g. for long words such as
[60]  German)? What about <break> in cases where words cannot
       clearly be identified (no spaces, such as in Chinese, Japanese,
       Thai). <break> should be allowed in these cases.

2.2.3 and 2.2.4: "x-high" and "x-low": the 'x-' prefix is part of
[61]  colloquial English in many parts of the world, but may be
       difficult to understand for non-native English speakers.
       Please add an explanation.


2.2.4: Please add a note that customary pitch levels and
[62]  pitch ranges may differ quite a bit with natural language, and that
       "high",... may refer to different absolute pitch levels for different
       languages. Example: Japanese has general much lower pitch range than
       Chinese.

2.2.4, 'baseline pitch', 'pitch range': Please provide definition/
[63]   short explanation.

2.2.4 'as a percent' -> 'as a percentage'
[64]

2.2.4 What is a 'semitone'? Please provide a short explanation.
[65]

2.2.4 In pitch contour, are white spaces allowed? At what places
[66]  exactly? In "(0%,+20)(10%,+30%)(40%,+10)", I would propose
       to allow whitespace between ')' and '(', but not elsewhere.
       This has the benefit of minimizing syntactict differences
       while allowing long contours to be formatted with line breaks.

2.2.4, bullets: Editorial nit.  It may help the first time reader to
[67]   mention that 'relative change' is defined a little further down.

2.2.4, 4th bullet: the speaking rate is set in words per minute.
[68]  In many languages what constitutes a word is often difficult to
       determine, and varies considerably in average length.
       So there have to be more details to make this work interoperably
       in different languages. Also, it seems that 'words per minute'
       is a nominal rate, rather than exactly counting words, which
       should be stated clearly. An much preferable alternative is to use
       another metric, such as syllables per minute, which has less
       unclarity (not

2.2.4, 5th bullet: If the default is 100.0, how do you make it
[69]  louder given that the scale ranges from 0.0 to 100.0?
       (or, in other words, is the default to always shout?)

2.2.4, Please state whether units such as 'Hz' are case-sensitive
[70] or case-insensitive. They should be case-sensitive, because
      units in general are (e.g. mHz (milliHz) vs. MHz (MegaHz)).


2.3.3 Please provide some example of <desc>
[71]

3.1  Requiring an XML declaration for SSML when XML itself
[72] doesn't require an XML declaration leads to unnecessary
      discrepancies. It may be very difficult to check this
      with an off-the-shelf XML parser, and it is not reasonable
      to require SSML implementations to write their own XML
      parsers or modify an XML parser. So this requirement
      should be removed (e.g. by saying that SSML requires an XML
      declaration when XML requires it).


3.3, last paragraph before 'The lexicon element' subtitle:
[73] Please also say that the determination of
      what is a word may be language-specific.

3.3 'type' attribute on lexicon element: What's this attribute used
[74] for? The media type will be determined from the document that
      is found at the 'uri' URI, or not?


4.1 'synthesis document fragment' -> 'speech synthesis document fragment'
[75]

4.1  Conversion to stand-alone document: xml:lang should not
[76] be removed. It should also be clear whether content of
      non-synthesis elements should be removed, or only the
      markup.


4.4 'requirement for handling of languages': Maybe better to
[77] say 'natural languages', to avoid confusion with markup
      languages. Clarification is also needed in the following
      bullet points.


4.5  This should say that a user agent has to support at least
[78] one natural language.


App A: 'http://www.w3c.org/music.wav': W3C's Web site is www.w3.org.
[79]   But this example should use www.example.org or www.example.com.

App B: 'synthesis DTD' -> 'speech synthesis DTD'
[80]

App D: Why does this mentions 'recording'? Please remove or explain.
[81]

App E: Please give a reference for the application to the IETF/IESG/IANA
[82]   for the content type 'application/ssml+xml'.

App F: 'Support for other phoneme alphabets.': What's a 'phoneme alphabet'?
[83]

App F, last paragraph: 'Unfortunately, ... no standard for designating
[84]   regions...': This should be worded differently. RFC 3066 provides
        for the registration of arbitrary extensions, so that e.g.
        en-gb-accent-scottish and en-gb-accent-welsh could be registered.

App F, bullet 3: I guess you already know that intonation
[85]   requirements can vary considerably across languages, so you'll
        need to cast your net fairly wide here.

App G: What is meant by 'input' and 'output' languages? This is the
[86]   first time this terminology is used. Please remove or clarify.

App G: 'overriding the SSML Processor default language': There should
[87]   be no such default language. An SSML Processor may only
        support a single language, but that's different from
        assuming a default language.



Regards,   Martin.
Received on Tuesday, 12 August 2003 14:20:33 UTC