- From: Richard Ishida <ishida@w3.org>
- Date: Fri, 4 Jul 2003 18:54:37 +0100
- To: "'Daniel Burnett'" <burnett@nuance.com>, <w3c-i18n-ig@w3.org>
- Cc: <www-voice@w3.org>
Dan, Martin asked me to point you to our responses to your responses regard SSML at http://www.w3.org/International/2003/ssml10/ssml-feedback.html Thankyou for the time you have dedicated to our comments. We hope these additional responses prove helpful. Best regards, Richard. ============ Richard Ishida W3C tel: +44 1753 480 292 http://www.w3.org/International/ http://www.w3.org/People/Ishida/ > -----Original Message----- > From: w3c-i18n-ig-request@w3.org > [mailto:w3c-i18n-ig-request@w3.org] On Behalf Of Daniel Burnett > Sent: 09 June 2003 19:16 > To: w3c-i18n-ig@w3.org > Cc: www-voice@w3.org > Subject: RE: Consolidated comments on SSML > > > > Dear Martin (and the Internationalization Working Group), > > Thank you again for your very thorough review of the SSML > specification. This email contains the second big block of > responses. Remaining points will be addressed in later emails. > > If you believe we have not adequately addressed your issues > with our responses, please let us know as soon as possible. > If we do not hear from you within 14 days, we will take this > as tacit acceptance. Given the volume of responses in this > email, we understand that a complete review by you may take > longer than this amount of time; if so, we would appreciate > an estimate as to when you might be able to complete your review. > > Once again, thank you for your thorough and considered input > on the specification. > > -- Dan Burnett > > Synthesis Team Leader, VBWG > > [VBWG responses follow] > > [1] Rejected. We reject the notion that on principle this is > more difficult for some languages. For all languages > supported by synthesis vendors today this is not a problem. > As long as there is a way to write the text, the engine can > figure out how to speak it. Given the lack of broad support > by vendors for Arabic and Hebrew, we prefer not to include > examples for those languages. > > [2] Rejected. Special tagging for bidirectional rendering > would only be needed if there were not already a means of > clearly indicating the language, language changes, and the > sequence of languages. In SSML it is always clear when a > language shift occurs -- either when xml:lang is used or when > the <voice> element is used. In any case, the encoding into > text handles this itself. We believe that it is sufficient > to require a text/Unicode representation for any language > text. Visual or other non-audio rendering from that > representation is outside the scope of SSML. > > [7] Accepted. We will describe the relationship. > > [13] Accepted. We agree that this is confusing. We will > make section 1.1 more text-only and cross-reference as > necessary. We will also remove "Vocabulary" from the title of > section 1.1. > > [17] Accepted. xml:lang will now be mandatory on the root > <speak> element. > > [20] Rejected/Question. For all but the <desc> element, this > can be accomplished using the <voice> element. For the > <desc> element, it's unclear why the description would be in > a language different from that in which it is embedded; can > you provide a better use case? In the <voice> element > description we will point out that one of its common uses is > to change the language. In 2.1.2, we will mention that > xml:lang is permitted as a convenience on <p> and <s> only > because it's common to change the language at those levels. > We recommend that other changes in the language be done with > the <voice> element. > > [25] Yes, the "may" is a keyword as in rfc2119, and > conformant processors are permitted to vary in their > implementation of xml:lang in SSML. Although processors are > required to implement the standard xml:lang behavior defined > by XML 1.0, in SSML the attribute also implies a change in > voice which may or may not be observed by the processor. We > will clarify this in the specification. > > [26] Accepted. We accept the editorial change. > We will remove the <paragraph> and <sentence> elements. > > [27] Accepted. As you suggest, we will remove the examples > from this section in order to reduce confusion. > > [29] Accepted. > > [32] This wording was accidentally left over from an earlier > draft. We will correct it. > > [38] Accepted. We will clarify in the text that this element > is designed for strictly phonemic and phonetic notations and > that the example uses Unicode to represent IPA. We will also > clarify that the phonemic/phonetic string does not undergo > text normalization and is not treated as a token for lookup > in the lexicon, while values in <say-as> and <sub> may undergo both. > > [40] Accepted. IPA is an alphabet of phonetic symbols. The > only representation in IPA is phonetic, although it is common > to select specific phones as representative examples of > phonemic classes. Also, IPA is only one possible alphabet > that can be used in this element. The <phoneme> element will > accept both phonetic and phonemic alphabets, and both > phonetic and phonemic string values for the ph attribute. We > will clarify this and add or reference a description of the > difference between phonemic and phonetic. > > [47] Rejected. There is no intention that pronunciations can > be given by other means within an SSML document. Any use of > SSML in this way is outside the scope of the language. Note > that pronunciations can of course be given in an external > lexicon; it is conceivable that other annotation formats > could be used in such a document. > > [60] Rejected. This is a tokenization issue. Tokens in SSML > are delimited both by white space and by SSML elements. You > can write a word as two separate words and it will have a > break, you can insert an SSML element, or you can use stress > marks externally. For Asian languages with characters without > spaces to delimit words, if you insert SSML elements it > automatically creates a boundary between words. You can use > a similar approach for German, e.g. with > "Fussbalweltmeisterschaft". If you insert a <break> in the > middle it actually splits the word, but that's probably what > you wanted: Fussbal<break>weltmeisterschaft. If you wish to > insert prosodic controls, that would be handled better via an > external lexicon which can provide stress markers, etc. > > [70] Accepted. Although the units are already marked as > case- sensitive in the Schema, we will clarify in the text > that such units are case-sensitive. > > [78] Accepted. We will add this. > > [84] Accepted. We will revise the text appropriately. > > > > -----Original Message----- > > From: Martin Duerst [mailto:duerst@w3.org] > > Sent: Friday, January 31, 2003 7:50 PM > > To: www-voice@w3.org > > Cc: w3c-i18n-ig@w3.org > > Subject: Consolidated comments on SSML > > > > > > > > Dear Voice Browser WG, > > > > These are the Last Call comments on Speech Synthesis > > Markup Language (http://www.w3.org/TR/speech-synthesis/) > > from the Core Task Force of the Internationalization (I18N) > WG. Please > > make sure that you send all emails regarding these comments to > > w3c-i18n-ig@w3.org, rather than to me personally or just to > > www-voice@w3.org (to which we are not subscribed). > > > > These comments are based on review by Richard Ishida and myself and > > have been discussed and approved the last I18N Core TF > teleconference. > > They are ordered by section and numbered for easy > reference. We have > > not classified these issues into editorial and substantial, but we > > think that it should be clear from their discription. > > > > General: > > [01] For some languages, text-to-speech conversion is more > difficult > > than for others. In particular, Arabic and Hebrew are usually > > written with none or only a few vowels indicated. Japanese > > often needs separate indications for pronunciation. > > It was no clear to us whether such cases were considered, > > and if they had been considered, what the appropriate > > solution was. > > SSML should be clear about how it is expected to handle these > > cases, and give examples. Potential solutions we > came up with: > > a) require/recommend that text in SSML is written in an > > easily 'speakable' form (i.e. vowelized for Arabic/Hebrew, > > or with Kana (phonetic alphabet(s)) for Japanese. (Problem: > > displaying the text visually would not be > satisfactory in this > > case); b) using <sub>; c) using <phoneme> (Problem: only > > having IPA available would be too tedious on authors); > > d) reusing some otherwise defined markup for this purpose > > (e.g. <ruby> from http://www.w3.org/TR/ruby/ for Japanese); > > e) creating some additional markup in SSML. > > > > General: Tagging for bidirectional rendering is not needed > [02] for > > text-to-speech conversion. But there is some provision > > for SSML content to be displayed visually (to cover WAI > > needs). This will not work without adequate support of bidi > > needs, with appropriate markup and/or hooks for styling. > > > > General: Is there a tag that allows to change the language in [03] > > the middle of a sentence (such as <html:span>)? If not, > > why not? This functionality needs to be provided. > > > > > > Abstract: 'is part of this set of new markup specifications': > > Which set? > > [04] > > > > Intro: 'The W3C Standard' -> 'This W3C Specification' > > [05] > > > > Intro: Please shortly describe the intended uses of SSML here, > > [06] rather than having the reader wait for Section 4. > > > > > > Section 1, para 2: Please shortly describe how SSML and > Sable are [07] > > related or different. > > > > > > 1.1, table: 'formatted text' -> 'marked-up text' > > [08] > > > > 1.1, last bullet: add a comma before 'and' to make > > [09] the sentence more readable > > > > > > 1.2, bullet 4, para 1: It might be nice to contrast the 45 phonemes > > [10] in English with some other language. This is just one case that > > shows that there are many opportunities for more > internationally > > varied examples. Please take any such oppurtunities. > > > > 1.2, bullet 4, para 3: "pronunciation dictionary" -> > > [11] "language-specific pronunciation dictionary" > > > > 1.2: How is "Tlalpachicatl" pronounced? Other examples may > be [12] > > St.John-Smyth (sinjen-smaithe) or Caius College > > (keys college), or President Tito (sutto) [president of the > > republic of Kiribati (kiribass) > > > > > > 1.1 and 1.5: Having a 'vocabulary' table in 1.1 and then a [13] > > terminology section is somewhat confusing. > > Make 1.1 e.g. more text-only, with a reference to 1.5, > > and have all terms listed in 1.5. > > > > 1.5: The definition of anyURI in XML Schema is considerably > wider [14] > > than RFC 2396/2732, in that anyURI allows non-ASCII characters. > > For internationalization, this is very important. The text > > must be changed to not give the wrong impression. > > > > 1.5 (and 2.1.2): This (in particular 'following the > > [15] XML specification') gives the wrong impression of where/how > > xml:lang is defined. xml:lang is *defined* in the XML spec, > > and *used* in SSML. Descriptions such as 'a language code is > > required by RFC 3066' are confusing. What kind of > language code? > > Also, XML may be updated in the future to a new version of RFC > > 3066, SSML should not restrict itself to RFC 3066 > > (similar to the recent update from RFC 1766 to RFC 3066). > > Please check the latest text in the XML errata for this. > > > > > > 2., intro: xml:lang is an attribute, not an element. > > [16] > > > > 2.1.1, para 1: Given the importance of knowing the language > for [17] > > speech synthesis, the xml:lang should be mandatory on the root > > speak element. If not, there should be a strong > > injunction to use it. > > > > 2.1.1: 'The version number for this specification is 1.0.': please > > [18] say that this is what has to go into the value of the 'version' > > attribute. > > > > > > 2.1.2., for the first paragraph, reword: 'To indicate the > natural [19] > > language of an element and its attributes and subelements, > > SSML uses xml:lang as defined in XML 1.0.' > > > > The following elements also should allow xml:lang: > > [20] - <prosody> (language change may coincide with prosody change) > > - <audio> (audio may be used for foreign-language pieces) > > - <desc> (textual description may be different from audio, > > e.g. <desc xml:lang='en'>Song in Japanese</desc> > > - <say-as> (specific construct may be in different language) > > - <sub> > > - <phoneme> > > > > 2.1.2: 'text normalization' (also in 2.1.6): What does this > mean? [21] > > It needs to be clearly specified/explained, otherwise there may > > be confusion with things such as NFC (see Character Model). > > > > 2.1.2, example 1: Overall, it may be better to use utf-8 > rather than > > [22] iso-8859-1 for the specification and the examples. > > > > 2.1.2, example 1: To make the example more realistic, in > the paragraph > > [23] that uses lang="ja" you should have Japanese text - not an > > English > > transcription, which may not use as such on a Japanese > > text-to-speech > > processor. In order to make sure the example can be > viewed even > > in situations where there are no Japanese fonts available, and > > can be understood by everybody, some explanatory text > > can provide > > the romanized from. (we can help with Japanese if necessary) > > > > 2.1.2, 1st para after 1st example: Editorial. We prefer > "In the [24] > > case that a document requires speech output in a language not > > supported by the processor, the speech processor > > largely determines > > the behavior." > > > > 2.1.2, 2nd para after 1st example: "There may be > variation..." [25] Is > > the 'may' a keyword as in rfc2119? Ie. Are you allowing > > conformant processors to vary in the implementation > of xml:lang? > > If yes, what variations exactly would be allowed? > > > > > > 2.1.3: 'A paragraph element represents the paragraph > structure' [26] > > -> 'A paragraph element represents a paragraph'. (same for sentence) > > Please decide to either use <p> or <paragraph>, but not both > > (and same for sentence). > > > > > > 2.1.4: <say-as>: For interoperability, defining attributes [27] and > > giving (convincingly useful) values for these attributes > > but saying that these will be specified in a separate document > > is very dangerous. Either remove all the details (and then > > maybe also the <say-as> element itself), or say that the > > values given here are defined here, but that future versions > > of this spec or separate specs may extend the list of values. > > [Please note that this is only about the attribute values, > > not the actual behavior, which is highly language-dependent > > and probably does not need to be specified in every detail.] > > > > 2.1.4, interpret-as and format, 6th paragraph: requirement > that [28] > > text processor has to render text in addition to the indicated > > content type is a recipe for bugwards compatibility (which > > should be avoided). > > > > 2.1.4, 'locale': change to 'language'. > > [29] > > > > 2.1.4: How is format='telephone' spoken? > > [30] > > 2.1.4: Why are there 'ordinal' and 'cardinal' values for both > > [31] interpret-as and format? > > > > 2.1.4 'The detail attribute can be used for all say-as > content types.' > > [32] What's a content type in this context? > > > > 2.1.4 detail 'strict': 'speak letters with all detail': As opposed > > [33] to what (e.g. in that specific example)? > > > > 2.1.4, last table: There seem to be some fixed-width aspects in the > > [34] styling of this table. This should be corrected to > > allow complete > > viewing and printing at various overall widths. > > > > 2.1.4, 4th para (and several similar in other sections): > > [35] "The say-as element can only contain text." would be easier > > to understand; we had to look around to find out whether the > > current phrasing described an EMPTY element or not. > > > > 2.1.4. For many languages, there is a need for additional > information. > > [36] For example, in German, ordinal numbers are denoted > > with a number > > followed by a period (e.g. '5.'). They are read > > depending on case > > and gender of the relevant noun (as well as depending > > on the use > > of definite or indefinite article). > > > > 2.1.4, 4th row of 2nd table: I've seen some weird phone > formats, but > > [37] nothing quite like this! Maybe a more normal example would NOT > > pronounce the separators. (Except in the Japanese > > case, where the > > spaces are (sometimes) pronounced (as 'no').) > > > > > > 2.1.5, <phoneme>: > > [38] It is unclear to what extent this element is designed for > > strictly phonemic and phonetic notations, or also > (potentially) > > for notations that are more phonetic-oriented than > > usual writing > > (e.g. Japanese kana-only, Arabic/Hebrew with full vowels,...) > > and where the boundaries are to other elements such > as <say-as> > > and <sub>. This needs to be clarified. > > > > 2.1.5 There may be different flavors and variants of IPA (see e.g. > > [39] references in ISO 10646). Please make sure it is clear which > > one is used. > > > > 2.1.5 IPA is used both for phonetic and phonemic notations. Please > > [40] clarify which one is to be used. > > > > 2.1.5 This may need a note that not all characters used in IPA are > > [41] in the IPA block. > > > > 2.1.5 This seems to say that the only (currently) allowed value for > > [42] alphabet is 'ipa'. If this is the case, this needs to be said > > very clearly (and it may as well be defined as default, and > > in that case the alphabet attribute to be optional). If there > > are other values currently allowed, what are they? How are > > they defined? > > > > 2.1.5 'alphabet' may not be the best name. Alphabets are > sets of [43] > > characters, usually with an ordering. The same set of characters > > could be used in totally different notations. > > > > 2.1.5 What are the interactions of <phoneme> for foreign > language [44] > > segments? Do processors have to handle all of IPA, or only the > > phonemes that are used in a particular language? > > Please clarify. > > > > 2.1.5, 1st example: Please try to avoid character entities, as it > > [45] suggests strongly that this is the normal way to input this > > stuff. > > (see also issue about utf-8 vs. iso-8859-1) > > > > > > 2.1.5 and 2.1.6: The 'alias' and 'ph' attributes in some > > [46] cases will need additional markup (e.g. for fine-grained > > prosody, but also for additional emphasis, bidirectionality). > > This would also help tools for translation,... > > But markup is not possible for attributes. These attributes > > should be changed to subelements, e.g. similar to the <desc> > > element inside <audio>. > > > > 2.1.5 and 2.1.6: Can you specify a null string for the ph and alias > > [47] attributes? This may be useful in mixed formats where the > > pronunciation is given by another means, e.g. with ruby > > annotation. > > > > > > 2.1.6 The <sub> element may easily clash or be confused with <sub> > > [48] in HTML (in particular because the specification seems to be > > designed to allow combinations with other markup vocabularies > > without using different namespaces). <sub> should be renamed, > > e.g. to <subst>. > > > > 2.1.6 For abbreviations,... there are various cases. Please > check [49] > > that all the cases in > > > > http://lists.w3.org/Archives/Member/w3c-i18n-ig/2002Mar/0064.html > > are covered, and that the users of the spec know how > to handle > > them. > > > > 2.1.6, 1st para: "the specified text" -> > > [50] "text in the alias attribute value". > > > > > > 2.2.1, between the tables: "If there is no voice available for the > > [51] requested language ... select a voice ... same language but > > different > > region..." I'm not sure this makes sense. I could > > understand that > > if there is no en-UK voice you'd maybe go for an en-US > > voice - this > > is a different DIALECT of English. If there are no > > Japanese voices > > available for Japanese text, I'm not sure it makes > > sense to use an > > English voice. What happens in this situation? > > > > 2.2.1 It should be mentioned that in some cases, it may make > > sense to have > > [52] a short piece of e.g. 'fr' text in an 'en' text been spoken by > > an 'en' text-to-speech converter (the way it's often done by > > human readers) rather than to throw an error. This is quite > > different for longer texts, where it's useless to bother an > > user. > > > > 2.2.1: We wonder if there's a need for multiple voices (eg. A > > group of kids) > > [53] > > > > 2.2.1, 2nd example: You should include some text here. > > [54] > > > > 2.2.1 The 'age' attribute should explicitly state that the integer > > [55] is years, not something else. > > > > 2.2.1 The variant attribute should say what it's index > origin is [56] > > (e.g. either starting at 0 or at 1) > > > > 2.2.1 attribute name: (in the long term,) it may be > desirable to use > > [57] an URI for voices, and to have some well-defined format(s) > > for the necessary data. > > > > 2.2.1, first example (and many other places): The line > break between > > [58] the <voice> start tag and the text "It's fleece was white as > > snow." > > will have negative effects on visual rendering. > > (also, "It's" -> "Its") > > > > 2.2.1, description of priorities of xml:lang, name, > variant,...: [59] > > It would be better to describe this clearly as priorities, > > i.e. to say that for voice selection, xml:lang has highest > > priority,... > > > > > > 2.2.3 What about <break> inside a word (e.g. for long words such as > > [60] German)? What about <break> in cases where words cannot > > clearly be identified (no spaces, such as in > Chinese, Japanese, > > Thai). <break> should be allowed in these cases. > > > > 2.2.3 and 2.2.4: "x-high" and "x-low": the 'x-' prefix is > part of [61] > > colloquial English in many parts of the world, but may be > > difficult to understand for non-native English speakers. > > Please add an explanation. > > > > > > 2.2.4: Please add a note that customary pitch levels and > > [62] pitch ranges may differ quite a bit with natural > > language, and that > > "high",... may refer to different absolute pitch > > levels for different > > languages. Example: Japanese has general much lower > > pitch range than > > Chinese. > > > > 2.2.4, 'baseline pitch', 'pitch range': Please provide definition/ > > [63] short explanation. > > > > 2.2.4 'as a percent' -> 'as a percentage' > > [64] > > > > 2.2.4 What is a 'semitone'? Please provide a short explanation. [65] > > > > 2.2.4 In pitch contour, are white spaces allowed? At what > places [66] > > exactly? In "(0%,+20)(10%,+30%)(40%,+10)", I would propose > > to allow whitespace between ')' and '(', but not elsewhere. > > This has the benefit of minimizing syntactict differences > > while allowing long contours to be formatted with > line breaks. > > > > 2.2.4, bullets: Editorial nit. It may help the first time reader to > > [67] mention that 'relative change' is defined a little > > further down. > > > > 2.2.4, 4th bullet: the speaking rate is set in words per > minute. [68] > > In many languages what constitutes a word is often difficult to > > determine, and varies considerably in average length. > > So there have to be more details to make this work > > interoperably > > in different languages. Also, it seems that 'words > per minute' > > is a nominal rate, rather than exactly counting words, which > > should be stated clearly. An much preferable > > alternative is to use > > another metric, such as syllables per minute, which has less > > unclarity (not > > > > 2.2.4, 5th bullet: If the default is 100.0, how do you make > it [69] > > louder given that the scale ranges from 0.0 to 100.0? > > (or, in other words, is the default to always shout?) > > > > 2.2.4, Please state whether units such as 'Hz' are > case-sensitive [70] > > or case-insensitive. They should be case-sensitive, because > > units in general are (e.g. mHz (milliHz) vs. MHz (MegaHz)). > > > > > > 2.3.3 Please provide some example of <desc> > > [71] > > > > 3.1 Requiring an XML declaration for SSML when XML itself [72] > > doesn't require an XML declaration leads to unnecessary > > discrepancies. It may be very difficult to check this > > with an off-the-shelf XML parser, and it is not reasonable > > to require SSML implementations to write their own XML > > parsers or modify an XML parser. So this requirement > > should be removed (e.g. by saying that SSML requires an XML > > declaration when XML requires it). > > > > > > 3.3, last paragraph before 'The lexicon element' subtitle: > [73] Please > > also say that the determination of > > what is a word may be language-specific. > > > > 3.3 'type' attribute on lexicon element: What's this attribute used > > [74] for? The media type will be determined from the document that > > is found at the 'uri' URI, or not? > > > > > > 4.1 'synthesis document fragment' -> 'speech synthesis > > document fragment' > > [75] > > > > 4.1 Conversion to stand-alone document: xml:lang should > not [76] be > > removed. It should also be clear whether content of > > non-synthesis elements should be removed, or only the > > markup. > > > > > > 4.4 'requirement for handling of languages': Maybe better > to [77] say > > 'natural languages', to avoid confusion with markup > > languages. Clarification is also needed in the following > > bullet points. > > > > > > 4.5 This should say that a user agent has to support at least [78] > > one natural language. > > > > > > App A: 'http://www.w3c.org/music.wav': W3C's Web site is www.w3.org. > > [79] But this example should use www.example.org or > www.example.com. > > > > App B: 'synthesis DTD' -> 'speech synthesis DTD' > > [80] > > > > App D: Why does this mentions 'recording'? Please remove or > explain. > > [81] > > > > App E: Please give a reference for the application to the > > IETF/IESG/IANA > > [82] for the content type 'application/ssml+xml'. > > > > App F: 'Support for other phoneme alphabets.': What's a > > 'phoneme alphabet'? > > [83] > > > > App F, last paragraph: 'Unfortunately, ... no standard for > designating > > [84] regions...': This should be worded differently. RFC > > 3066 provides > > for the registration of arbitrary extensions, so that e.g. > > en-gb-accent-scottish and en-gb-accent-welsh could be > > registered. > > > > App F, bullet 3: I guess you already know that intonation > > [85] requirements can vary considerably across languages, > so you'll > > need to cast your net fairly wide here. > > > > App G: What is meant by 'input' and 'output' languages? This is the > > [86] first time this terminology is used. Please remove > or clarify. > > > > App G: 'overriding the SSML Processor default language': > There should > > [87] be no such default language. An SSML Processor may only > > support a single language, but that's different from > > assuming a default language. > > > > > > > > Regards, Martin. > > > > >
Received on Friday, 4 July 2003 13:54:57 UTC