SSML Last Call Review Issues

SSML last call

I18N WG comments

Response part 1, Response part 2, Response part 3, Joint telecons (18 Sep., 19 Sep., 3 Oct.), Sep/Oct email exchange (Martin, Dan)

Issue Discussion Type Status
1 For some languages, text-to-speech conversion is more difficult than for others. In particular, Arabic and Hebrew are usually written with none or only a few vowels indicated. Japanese often needs separate indications for pronunciation. It was no clear to us whether such cases were considered, and if they had been considered, what the appropriate solution was. SSML should be clear about how it is expected to handle these cases, and give examples. Potential solutions we came up with: a) require/recommend that text in SSML is written in an easily 'speakable' form (i.e. vowelized for Arabic/Hebrew, or with Kana (phonetic alphabet(s)) for Japanese. (Problem: displaying the text visually would not be satisfactory in this case); b) using <sub>; c) using <phoneme> (Problem: only having IPA available would be too tedious on authors); d) reusing some otherwise defined markup for this purpose (e.g. <ruby> from http://www.w3.org/TR/ruby/ for Japanese); e) creating some additional markup in SSML.
Rejected. We reject the notion that on principle this is more difficult for some languages. For all languages supported by synthesis vendors today this is not a problem. As long as there is a way to write the text, the engine can figure out how to speak it. Given the lack of broad support by vendors for Arabic and Hebrew, we prefer not to include examples for those languages.
I suspect from discussions with WAI on this topic and some research with experts in the field, that the lack of broad support by vendors for Arabic and Hebrew is actually a function of the fact that (unvowelled) text in these scripts is more difficult to support than other scripts. Of course, this issue can be circumvented by adding vowels to all text used in SSML - that would probably be feasible for text written specifically for synthesis, but would not be appropriate for text that is intended to be read visually.

I also worry that considering only languages "supported by synthesis vendors today" is running counter to the idea of ensuring universal access. It's like saying it's ok to design the web for english if the infrastructure only supports english. The i18n group is trying to ensure that we remove obstacles to adoption of technology by people from an ever growing circle of languages and cultures.

Agreed with Richard. This is really important, and goes to the core of the I18N activity. There may be a chicken-and-egg problem for Hebrew and Arabic, and the spec should clearly state what is allowed and what not. In addition, there are enough vendors for Japanese, I guess, so Japanese could be used as an example, and Arabic/Hebrew just explained in the text.

S
2 General: Tagging for bidirectional rendering is not needed [02] for text-to-speech conversion. But there is some provision for SSML content to be displayed visually (to cover WAI needs). This will not work without adequate support of bidi needs, with appropriate markup and/or hooks for styling.
Rejected. Special tagging for bidirectional rendering would only be needed if there were not already a means of clearly indicating the language, language changes, and the sequence of languages. In SSML it is always clear when a language shift occurs -- either when xml:lang is used or when the <voice> element is used. In any case, the encoding into text handles this itself. We believe that it is sufficient to require a text/Unicode representation for any language text. Visual or other non-audio rendering from that representation is outside the scope of SSML.
Disagree - see the example at http://www.w3.org/International/questions/qa-bidi-controls.html(in the Background) - the bidi algorithm alone is not sufficient to produce the correct ordering of text for display in this case.

xml:lang is not sufficient or appropriate to resolve bidi issues because there are many minority languages that use RTL scripts. This is an important issue.


We would like to discuss this one with you live. In particular, we would like to understand why you believe the xml:lang attribute provides insufficient information to address this.
Why do you need something for SSML visually?

Martin: The problem is embedded text: For example, Latin text with Arabic text inside. The directions of the writing are different. This needs to be handled.

Richard: XML:lang only indicates what language, it has nothing to do with presentation. We need markup to say which text is part of what language, so it can be rendered left-to-right or right-to-left appropriately.

Dan: This has been answered in other languages, e.g., xhtml. Why can't people import namespace elements from xhtml to control this behavior?

Martin: This is one way to solve this issue.

Dan: Visual rendering is NOT the goal of SSML

Jim: Here is a draft statement of a solution: for visual display of bidirection text, use xhtml tags

Martin: that is good, the following is better: For control of bidirectional presentation for visual behavior, use xhtml tags such as ...

Dan: we need a general solution to this problem that can be used by any markup language.

Jim: We are confused about what should be done. Let's not say anything in SSML until we have a clear direction of how to proceed.

Dan: We could define something like xmlbase, and then all languages could refer to it. If all xml documents need to render portions of text visually with bidirectionality, then xml needs to support it.

Luc: It doesn't make sense that this should be part of several markups. We need a generic mechanism that crosses all xml languages.

Martin: This is an internationalization requirement on top of an accessibility requirement.

Dan: It is the accessibility group to make the decision about how to proceed.

Luc: Here is another idea: users can use Unicode controls and visual renderer

Dan: We have not come to a resolution on this point. I will write up a draft solution for us to begin with in our next telecom. We will schedule another teleconh via e-mail for tomorrow.


Dan sent a bidi summary email and Martin responded.
Richard: Bidi is a small change with attribute with values left-to-right, right-to-left, right-left-override, and left-right-override Only Mongolian currently uses top to down.

Jim: Martin should write white paper for Philip Hoscka's group outlining (1) the problem, (2) current preferred solution. Hoscka's group should review this, and figure out next steps.

Martin: And add this as an issue for V3.

Jim: We will add this as an issue for V3.

Martin: I will write paper and forward to Hoscka's group for a broader discussion.

S? action
3 General: Is there a tag that allows to change the language in [03] > the middle of a sentence (such as <html:span>)? If not, > why not? This functionality needs to be provided.
Yes, the <voice> tag. In section 3.1.2 (xml:lang), we will note that the <voice> element can be used to change just the language.
No obvious issue here
S Ok
4 Abstract: 'is part of this set of new markup specifications': Which set? [04]
"this set" refers to "standards to enable access to the Web using spoken interaction" from the previous sentence. If you believe this to be unclear, can you suggest an appropriately compact rewording (since this is text from the one-paragraph abstract)?
No. I suggest "The Voice Browser Working Group has sought to develop standards for markup to enable access to the Web using spoken interaction with voice browsers. The Speech Synthesis Markup Language Specification is one of these standards,..."
Thank you for your suggestion. However, we disagree with the addition of "with voice browsers" since it is too limiting to restrict the use of the Voice Browser Working Group's specifications only to voice browsers. There are already use cases for the SRGS and SSML specifications outside of VoiceXML (for example, handwriting recognition for SRGS and MRCP/ speechsc for both). We currently intend to keep the first sentence as it is.
Dan: Both SSML and SRGS are used outside of VoiceXML.

Dan: change to "SSML is one of these standards and is designed to . . ."

S? approved
5 Intro: 'The W3C Standard' -> 'This W3C Specification'
Accepted
E
6 Intro: Please shortly describe the intended uses of SSML here, [06] rather than having the reader wait for Section 4.
Rejected. We had already planned to rearrange sections such that section 2 now contains the Document Form (formerly section 3.1), Conformance (formerly section 4), Integration (formerly 3.5), and Fetching (formerly 3.6) sections straight off. If you believe this to be insufficient, can you propose a specific text change for section 1?
Think you should still have a short paragraph in the beginning of the intro to indicate intended use of SSML, who should use it, and how.

This will help people:

  • decide whether or not they need to read further
  • help people to understand the application of concepts better as they read the spec (for example, I was always confused about whether this was intended to be used on its own or with other markup such as xhtml, and whether that was untouched or modified existing xhtml. This made it difficult to really understand all the implication of what I was reading straight away.)
E
7 Section 1, para 2: Please shortly describe how SSML and Sable are [07] related or different.
Accepted. We will describe the relationship.
E ok
8 1.1, table: 'formatted text' -> 'marked-up text'
Accepted.
E ok
9 1.1, last bullet: add a comma before 'and' to make [09] the sentence more readable
Accepted
E ok
10 1.2, bullet 4, para 1: It might be nice to contrast the 45 phonemes [10] in English with some other language. This is just one case that shows that there are many opportunities for more internationally varied examples. Please take any such oppurtunities.
We would welcome a specific text proposal from your group. Any language example is fine with us.
http://pluto.fss.buffalo.edu/classes/psy/jsawusch/psy719/Articulation-2.pdf says Hawai'ian has 11 phonemes. Hawai'ian is indeed very low in phonemes, but 11 seems too low. http://www.ling.mq.edu.au/units/ling210-901/phonology/210_tutorials/tutorial1.html gives 12 with actual details, and may be correct. http://www.sciam.com/article.cfm?articleID=000396B3-70AD-1E6E-A98A809EC5880105 contains other numbers: 18 for Hawai'ian, and more than 100 for !Kung.

We could say something like Hawaian includes fewer than 15 phonemes. Bernard Comrie's Major Languages of South Asia, The Middle East and Africa lists 29 phonemes for Persian. His book Major Languages of East & South East Asia lists 22 for Tagalog

The Atlas of Languages, by Comrie et al, lists 14 phonemes for Hawaian and says that Rotokas, a Papuan language of Bougainville in the North Solomons, is recorded in the Guiness Book of Records as the language with fewest phonemes: 5 vowels and 6 consonants.


Accepted. Thank you for your suggestions. We will apply some of them to this section.
E ok
11 1.2, bullet 4, para 3: "pronunciation dictionary" -> [11] "language-specific pronunciation dictionary"
Accepted with changes. Instead of this change, we will add "(which may be language dependent)" after the word "dictionary".
Dan: item 11: Pronunciation dictionary may be language specific

Martin: OK

E? approved
12 1.2: How is "Tlalpachicatl" pronounced? Other examples may be [12] St.John-Smyth (sinjen-smaithe) or Caius College (keys college), or President Tito (sutto) [president of the republic of Kiribati (kiribass)
Although it is clear that you wish to see examples in the specification of pronunciations that might not be clear to speakers of American English, it is unclear why you believe they should be included in section 1.2 which describes the process of synthesis itself. Can you please provide a proposal with detailed text changes? We would prefer that you include pronuciations as well (in IPA, as suggested by the specification) since you appear to have specific pronunciations in mind.
Dan: In section 1.2 we describe the synthesis process. What changes do you want?

Martin: Our suggestions are alternative suggestions to the "Mexican" example. Conclusion: Replace the Mexican example with one or two of our examples.


VBWG/I18N folks accept resolution for SSML to replace the existing example with one of the ones given. Luc says he wrote down that VBWG would provide an informal pronunciation (e.g., "keys college") in the specification.
E? ok
13 1.1 and 1.5: Having a 'vocabulary' table in 1.1 and then a [13] terminology section is somewhat confusing. Make 1.1 e.g. more text-only, with a reference to 1.5, and have all terms listed in 1.5.
Accepted. We agree that this is confusing. We will make section 1.1 more text-only and cross-reference as necessary. We will also remove "Vocabulary" from the title of section 1.1.
E ok
14 1.5: The definition of anyURI in XML Schema is considerably wider [14] than RFC 2396/2732, in that anyURI allows non-ASCII characters. For internationalization, this is very important. The text must be changed to not give the wrong impression.
Accepted. We will amend the text to indicate that only the Schema reference is normative and not the references to RFC2396/2732.
S ok
15 1.5 (and 2.1.2): This (in particular 'following the [15] XML specification') gives the wrong impression of where/how xml:lang is defined. xml:lang is *defined* in the XML spec, and *used* in SSML. Descriptions such as 'a language code is required by RFC 3066' are confusing. What kind of language code? Also, XML may be updated in the future to a new version of RFC 3066, SSML should not restrict itself to RFC 3066 (similar to the recent update from RFC 1766 to RFC 3066). Please check the latest text in the XML errata for this.
Accepted. All that you say is correct. We will revise the text to clarify as you suggest.
S ok
16 2., intro: xml:lang is an attribute, not an element. [16]
Accepted. Thank you. We will correct this.
E ok
17 2.1.1, para 1: Given the importance of knowing the language for [17] speech synthesis, the xml:lang should be mandatory on the root speak element. If not, there should be a strong injunction to use it.
Accepted. xml:lang will now be mandatory on the root <speak> element.
S ok
18 2.1.1: 'The version number for this specification is 1.0.': please [18] say that this is what has to go into the value of the 'version' attribute.
] Accepted
S ok
19 2.1.2., for the first paragraph, reword: 'To indicate the natural [19] language of an element and its attributes and subelements, SSML uses xml:lang as defined in XML 1.0.'
Accepted with changes. This is related to point 15. We will reword this to correct the problems you mention in that point, but the rewording may vary some from the text you suggest.
E ok
20 The following elements also should allow xml:lang: [20] - <prosody> (language change may coincide with prosody change) - <audio> (audio may be used for foreign-language pieces) - <desc> (textual description may be different from audio, e.g. <desc xml:lang='en'>Song in Japanese</desc> - <say-as> (specific construct may be in different language) - <sub> - <phoneme>
Rejected/Question. For all but the <desc> element, this can be accomplished using the <voice> element. For the <desc> element, it's unclear why the description would be in a language different from that in which it is embedded; can you provide a better use case? In the <voice> element description we will point out that one of its common uses is to change the language. In 2.1.2, we will mention that xml:lang is permitted as a convenience on <p> and <s> only because it's common to change the language at those levels. We recommend that other changes in the language be done with the <voice> element.
Not sure why you should need to use the voice element in addition to these. First, seems like a lot of redundant work.

It is also counter to the general usage of xml:lang in XHTML/HTML, XML, etc. Eg. you don't usually use a span element if another element already surrounds the text you want to specify).

Allowing xml:lang on other tags also integrates the language information better into the structure of the document. For example, suppose you wanted to style or extract all descriptions in a particular language - this would be much easier if the xml:lang was associated directly with that content.

It would also help reduce the likelihood of errors where the voice element becomes separated from the element it is qualifying.

Re. " why the description would be in a language different from that in which it is embedded": If the author had embedded, eg, a sound-byte in another language (such as JFK saying "Ich bin ein berliner"), the desc element could be used to transcribe the text for those who cannot or do not want to play the audio. A similar approach could be used for sites that teach language or multilingual dictionaries to provide a fallback in case the audio cannot be played.


Rejected. As to why we require the <voice> element: Changing the language has strong implications for output voice change in SSML. We found in the end that because of the text normalization, prosody, etc. changes upon changing language in SSML we wanted clear author awareness that changing xml:lang was likely to change the voice and other speaking characteristics. We permitted xml:lang on the <p> and <s> elements only because those are the places where changes in such characteristics are both most common and least disruptive. Regarding the <desc> element and the use cases you've presented: The <desc> element is for a description, not for transcription. For your JFK example, this element would contain "President Kennedy speaking in Berlin" or "President Kennedy's famous German language gaffe", etc. depending on the purpose of the audio. The audio might even be music, for example! For your language teaching, etc. examples, such alternates should go in the content of the <audio> element itself.
Richard: Xml:lang will should change the visual display of the language.

Dan: We want to be clear that changing xml:lang will change other characteristics for voice applications. We permit xml:lang on <p> and <sentence> elements.

Dan: We can provide a description (if not a transcription) of audio. The content of the <audio> element is text that is rendered if the audio file is not available. xdml:lang is not available on <audio> element.

Martin: <audio> contains content which may use ssml elements.

Dan: correct

Luc: disc is designed to tell you what the audio is if the audio can't replay.

Luc: When you change a language for vocal rendering, that will have an effect on various other parameters (including gender, speed, age, pitch, etc.) which will be disruptive to the listener. There may be breaks between language shifts. Thus we discourage frequently use of xml:lang.

Dan: Think of it as a voice change, not language change.

Martin: This should be documented.

Luc: we have, in section 2.2.1, a description of what happens. After the second example, is additional description.

Martin: We have made progress in understanding. We will want to discuss this in the group. You have made a strong case for an exception to using xml:lang everywhere.


Martin believes this requires more careful review. Martin to write up an email with general structure of the audio element and desc subelement and present all combinations of language changes at the different levels (outside audio element, inside audio element content, within recorded audio (which might be music), within desc). Would show which combinations make sense and which don't and therefore which combinations still need to be addressed.
S? action
21 2.1.2: 'text normalization' (also in 2.1.6): What does this mean? [21] It needs to be clearly specified/explained, otherwise there may be confusion with things such as NFC (see Character Model).
Accepted. We will add a reference, both here and in section 2.1.6, to section 1.2, step 3, where this is described.
E ok
22 2.1.2, example 1: Overall, it may be better to use utf-8 rather than [22] iso-8859-1 for the specification and the examples.
Accepted with changes. The document is already in UTF-8, the default for both XML documents and W3C specifications. We will leave the Italian example in Latin-1. For everything else we will explicitly set the encoding to UTF-8. In the <phoneme> example, we will include the IPA characters in a comment so browsers that can display them will. Because the UTF-8 representation of these symbols is multi-character, they're hard to modify, cut and paste, etc. For that reason we'll leave the entity escape versions in the code itself. We will also comment that one would normally use the UTF-8 representation of these symbols and explain why we put them in a comment.
Looks okay (except that the reply says UTF-8 is multicharacter where it should say multibyte)
Luc thinks multi-character, and Martin thinks multi-byte. Martin suggests VBWG check internally. Luc will check on this one.
E?
23 2.1.2, example 1: To make the example more realistic, in the paragraph [23] that uses lang="ja" you should have Japanese text - not an English transcription, which may not use as such on a Japanese text-to-speech processor. In order to make sure the example can be viewed even in situations where there are no Japanese fonts available, and can be understood by everybody, some explanatory text can provide the romanized from. (we can help with Japanese if necessary)
[23] We would be happy to accept your offer to rewrite our example using appropriate Japanese text.
Nihongo-ga wakarimasen. -> 日本語が分かりません。
Accepted. Thanks for the Japanese text. We will incorporate it into the example.
E ok
24 2.1.2, 1st para after 1st example: Editorial. We prefer "In the [24] case that a document requires speech output in a language not supported by the processor, the speech processor largely determines the behavior."
Accepted
E ok
25 2.1.2, 2nd para after 1st example: "There may be variation..." [25] Is the 'may' a keyword as in rfc2119? Ie. Are you allowing conformant processors to vary in the implementation of xml:lang? If yes, what variations exactly would be allowed?
Yes, the "may" is a keyword as in rfc2119, and conformant processors are permitted to vary in their implementation of xml:lang in SSML. Although processors are required to implement the standard xml:lang behavior defined by XML 1.0, in SSML the attribute also implies a change in voice which may or may not be observed by the processor. We will clarify this in the specification.
E ok
26 2.1.3: 'A paragraph element represents the paragraph structure' [26] -> 'A paragraph element represents a paragraph'. (same for sentence) Please decide to either use <p> or <paragraph>, but not both (and same for sentence).
Accepted. We accept the editorial change. We will remove the <paragraph> and <sentence> elements.
S ok
27 2.1.4: <say-as>: For interoperability, defining attributes [27] and giving (convincingly useful) values for these attributes but saying that these will be specified in a separate document is very dangerous. Either remove all the details (and then maybe also the <say-as> element itself), or say that the values given here are defined here, but that future versions of this spec or separate specs may extend the list of values. [Please note that this is only about the attribute values, not the actual behavior, which is highly language-dependent and probably does not need to be specified in every detail.]
Accepted. As you suggest, we will remove the examples from this section in order to reduce confusion.
S
28 2.1.4, interpret-as and format, 6th paragraph: requirement that [28] text processor has to render text in addition to the indicated content type is a recipe for bugwards compatibility (which should be avoided). S ??
29 2.1.4, 'locale': change to 'language'.
Accepted
E
30 2.1.4: How is format='telephone' spoken?
How it would be spoken is processor-dependent. The <say-as> element only provides information on how to interpret (or normalize) a set of input tokens, not on how it is to be spoken. Also, as you pointed out in point 27, "format='telephone'" is merely an example and not a specified value, at least not at this time.
no comment
Q
31 2.1.4: Why are there 'ordinal' and 'cardinal' values for both [31] interpret-as and format?
Both are shown as examples to indicate two possible ways it could be done. Neither is actually a specified way to use the element, as you pointed out in point 27.
no comment
Q
32 2.1.4 'The detail attribute can be used for all say-as content types.' [32] What's a content type in this context?
This wording was accidentally left over from an earlier draft. We will correct it.
Q ok
33 2.1.4 detail 'strict': 'speak letters with all detail': As opposed [33] to what (e.g. in that specific example)?
[33] In this example, without the detail attribute a processor might leave out the colon or the dash, or it might not distinguish between lower case and capital letters. However, this is not actually a specified way to use the attribute, as you pointed out in point 27.
no comment
E ok
34 2.1.4, last table: There seem to be some fixed-width aspects in the [34] styling of this table. This should be corrected to allow complete viewing and printing at various overall widths.
Rejected. As you suggested in point 27, we will be removing all of the tables of examples in this section. If and when we re- introduce this table, we will correct any styling errors that remain.
no comment
E ok
35 2.1.4, 4th para (and several similar in other sections): [35] "The say-as element can only contain text." would be easier to understand; we had to look around to find out whether the current phrasing described an EMPTY element or not.
Accepted with changes. This statement you refer to that is present in all of the element descriptions will be modified to more fully describe the content model for the element, although it may not be worded exactly as you suggest.
E ok
36 2.1.4. For many languages, there is a need for additional information. [36] For example, in German, ordinal numbers are denoted with a number followed by a period (e.g. '5.'). They are read depending on case and gender of the relevant noun (as well as depending on the use of definite or indefinite article).
Rejected. We have had considerable discussion on this point. There are two parts to our response: (1) It is assumed that the synthesis processor will use all contextual information already at its disposal in order to render the text and markup it is given. For example, any relevant case or gender information that can be determined from text surrounding the <say-as> element is expected to be used. (2) The ways and contexts in which information other than the specific number value can be encoded via human language are many and varied. For example, the way you count in Japanese varies based on the type of object that you are counting. That level of complexity is well outside the intended use of the <say-as> element. It is expected in such cases that either the necessary contextual information is available, in normal surrounding text, as described in part 1 above, or the text is normalized by the application writer (e.g. "2" -> "zweiten"). We welcome any complete, multilingual proposals for consideration for a future version of SSML.
S
37 2.1.4, 4th row of 2nd table: I've seen some weird phone formats, but [37] nothing quite like this! Maybe a more normal example would NOT pronounce the separators. (Except in the Japanese case, where the spaces are (sometimes) pronounced (as 'no').)
Rejected. As you suggested in point 27, we will be removing these examples altogether. If we should decide to reintroduce them at some point, we would be happy to incorporate a revised or extended example from you.
E ok
38 2.1.5, <phoneme>: [38] It is unclear to what extent this element is designed for strictly phonemic and phonetic notations, or also (potentially) for notations that are more phonetic-oriented than usual writing (e.g. Japanese kana-only, Arabic/Hebrew with full vowels,...) and where the boundaries are to other elements such as <say-as> and <sub>. This needs to be clarified.
Accepted. We will clarify in the text that this element is designed for strictly phonemic and phonetic notations and that the example uses Unicode to represent IPA. We will also clarify that the phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon, while values in <say-as> and <sub> may undergo both.
S ok
39 2.1.5 There may be different flavors and variants of IPA (see e.g. [39] references in ISO 10646). Please make sure it is clear which one is used. S ??
40 2.1.5 IPA is used both for phonetic and phonemic notations. Please [40] clarify which one is to be used.
Accepted. IPA is an alphabet of phonetic symbols. The only representation in IPA is phonetic, although it is common to select specific phones as representative examples of phonemic classes. Also, IPA is only one possible alphabet that can be used in this element. The <phoneme> element will accept both phonetic and phonemic alphabets, and both phonetic and phonemic string values for the ph attribute. We will clarify this and add or reference a description of the difference between phonemic and phonetic.
S ok
41 2.1.5 This may need a note that not all characters used in IPA are [41] in the IPA block.
Accepted
E ok
42 2.1.5 This seems to say that the only (currently) allowed value for [42] alphabet is 'ipa'. If this is the case, this needs to be said very clearly (and it may as well be defined as default, and in that case the alphabet attribute to be optional). If there are other values currently allowed, what are they? How are they defined?
Accepted. Any arbitrary string value is permitted. The only one with a predefined meaning is "ipa". Others are vendor-specific and depend upon the underlying pronunciation model set used by the vendor. There are quality implications to requiring only IPA, so we permit other alphabets. We will add text clarifying this behavior. Also, the Working Group is considering the development of a standardized lexicon format that might address the issue of quality with portability.
Allowing arbitrary values can lead to conflicts, and other interoperability problems. Needs some more discussion.
Although there is a clear desire on both sides for increasingly portability, the more immediate concern is about conflicts among implementation-chosen names. VBWG will take this for discussion.
E?
43 2.1.5 'alphabet' may not be the best name. Alphabets are sets of [43] characters, usually with an ordering. The same set of characters could be used in totally different notations.
Rejected. The term "alphabet" is commonly used in this context within the Speech Recognition/Synthesis community. We do not believe a change is appropriate.
There should be a note explaining the specific use of 'alphabet' in this context.
Ok. We can add such a note.
Agreed that VBWG will add a note.
E? ok
44 2.1.5 What are the interactions of <phoneme> for foreign language [44] segments? Do processors have to handle all of IPA, or only the phonemes that are used in a particular language? Please clarify. Q ??
45 2.1.5, 1st example: Please try to avoid character entities, as it [45] suggests strongly that this is the normal way to input this stuff. (see also issue about utf-8 vs. iso-8859-1)
[45] What would you suggest is the normal way?
Pure character data in utf-8. Perhaps we can help you with this example, if you need.
Accepted with changes. We will change the example as described in the response to point 22.
E ok
46 2.1.5 and 2.1.6: The 'alias' and 'ph' attributes in some [46] cases will need additional markup (e.g. for fine-grained prosody, but also for additional emphasis, bidirectionality). This would also help tools for translation,... But markup is not possible for attributes. These attributes should be changed to subelements, e.g. similar to the <desc> element inside <audio>.
Rejected. Ultimately this problem can be considered to be labelling an arbitrary chunk of SSML for uses other than audio speech production. We do consider this functionality to be useful enough to have in the specification in some form today (as <sub> or something else). The current approach meets the accessibility needs but does not permit markup of the spoken form. Since these elements are primarily intended to be used only for short phrases (such as W3C, Mr. Smyth, etc.), we have not in practice encountered any significant limitations in our use of the existing elements. Changing the current elements would likely result in other changes throughout the specification, something we are loathe to do in this version of the specification without a stronger demonstration of practical need. We will revisit the topic of how best to achieve this labelling functionality in the next version of SSML.
S
47 2.1.5 and 2.1.6: Can you specify a null string for the ph and alias [47] attributes? This may be useful in mixed formats where the pronunciation is given by another means, e.g. with ruby annotation.
[47] Rejected. There is no intention that pronunciations can be given by other means within an SSML document. Any use of SSML in this way is outside the scope of the language. Note that pronunciations can of course be given in an external lexicon; it is conceivable that other annotation formats could be used in such a document.
If SSML will be grafted onto ordinary Japanese text written in, say, XHTML it is certain that at some point ruby text will be encountered. This is a visual device, but is character-based, involving a repetition of a portion of text in two different scripts - so the base text and the ruby text would be both read out by the synthesiser. This would not only sound strange, but be very distracting.

What we are asking is for the ability to nullify one of the runs of text.

It seems to me that this could happen in a number of ways:Presumably this could be done by removing the annotation or base in ruby text, but being able to nullify

  1. by removing the base in ruby text
  2. by allowing for the text in the base to be not spoken, either by application of a null string or a style assignment
  3. by the speech processor recognising ruby and dealing with it appropriately.

I would like to know what the SSML group thinks is the best approach, and think that you should add some note about expected behaviour in this case.


Rejected. You have discussed SSML as if it is a module for XHTML, but it isn't. Arbitrary embeddings of other markup languages are ignored (see answer to item 76). As such, Ruby is essentially not permitted in SSML. Since Ruby is for visual rendering, it might make more sense to pre-process your XHTML document to generate valid SSML. We intend to consider modularity in future versions of SSML and would of course welcome your input.
Q
48 2.1.6 The <sub> element may easily clash or be confused with <sub> [48] in HTML (in particular because the specification seems to be designed to allow combinations with other markup vocabularies without using different namespaces). <sub> should be renamed, e.g. to <subst>.
[48] Rejected. We have other elements such as <p> with the same potential conflict. Also, we have not particularly crafted element names to avoid conflicts with other markup vocabularies. We see no direct need to change this element name.
I still think, regardless of the potential for overlapping element names, that it would be more immediately apparent what the meaning of this element was (and therefore more user friendly) if it was called <subst>.
Rejected. We have not seen enough general interest to warrant this change.
S
49 2.1.6 For abbreviations,... there are various cases. Please check [49] that all the cases in http://lists.w3.org/Archives/Member/w3c-i18n-ig/2002Mar/0064.html are covered, and that the users of the spec know how to handle them.
Accepted. We will clarify within the text how application authors should handle the cases presented in the referenced email.
E
50 2.1.6, 1st para: "the specified text" -> [50] "text in the alias attribute value".
Accepted
E
51 2.2.1, between the tables: "If there is no voice available for the [51] requested language ... select a voice ... same language but different region..." I'm not sure this makes sense. I could understand that if there is no en-UK voice you'd maybe go for an en-US voice - this is a different DIALECT of English. If there are no Japanese voices available for Japanese text, I'm not sure it makes sense to use an English voice. What happens in this situation?
The last sentence in this paragraph describes what happens:

It is an error if the processor decides it does not have a voice that sufficiently matches the above criteria.

where error is defined in section 1.5. The short answer is that it is up to the processor to decide whether or not it has a realistic way of rendering the Japanese text. For example, it may be appropriate for the application to attempt to pronounce Japanese with an English accent by using its best mapping to English phonemes (similar to an example you gave in point 52). We will change the words "same language but different region" to be "a variant or dialect of the same language".
Luc points out that we were on purpose vague as to what "closest" means. The parenthesized text is then just an example of what closest might mean.

Martin and Luc point out that the first sentence in this section states that the algorithm is processor-specific, but then we go on to give some algorithm description. The key here is that the following sentences are normative but incomplete, and where they're incomplete the algorithm may be processor-specific. VBWG will take this for rewriting and review.

Richard also suggests changing "requested language" to "requested xml:lang". Agreed.

S
52 2.2.1 It should be mentioned that in some cases, it may make sense to have [52] a short piece of e.g. 'fr' text in an 'en' text been spoken by an 'en' text-to-speech converter (the way it's often done by human readers) rather than to throw an error. This is quite different for longer texts, where it's useless to bother an user.
[52] Rejected. This behavior is already permitted at processor discretion for arbitrary-length strings of text. Specific words or short phrases can be handled in a more predictable manner by creating custom pronunciations in an external lexicon. We do not believe this needs additional explanation in the document.
even if this is already allowed at processor discretion, many implementers may forget that this may be a more reasonable behavior, so it should be mentioned.
Accepted. We will describe this situation in the document and provide an example.
E ok (pending final text review)
53 2.2.1: We wonder if there's a need for multiple voices (eg. A group of kids)
We have not had significant demand to standardize a value for this, e.g. <voice name="kids">. Individual processors are of course permitted to provide any voices they wish.
Q
54 2.2.1, 2nd example: You should include some text here.
Accepted. If you provided us with example text in Japanese here we would be more than happy to include it.
Conclusion: Martin will try to send example to Dan by the end of next week, or ignore the comment.
E? action
55 2.2.1 The 'age' attribute should explicitly state that the integer [55] is years, not something else.
Accepted
E ok
56 2.2.1 The variant attribute should say what it's index origin is [56] (e.g. either starting at 0 or at 1)
Accepted. The text and schema will be adjusted to clarify that this attribute can only contain positive integers.
E
57 2.2.1 attribute name: (in the long term,) it may be desirable to use [57] an URI for voices, and to have some well-defined format(s) for the necessary data.
Rejected. This is an interesting suggestion that we will be happy to consider for the next version of SSML (after 1.0).
please consider this for the next version
S ok
58 2.2.1, first example (and many other places): The line break between [58] the <voice> start tag and the text "It's fleece was white as snow." will have negative effects on visual rendering. (also, "It's" -> "Its")
We will correct the typo. Can you further explain your concern with the line break? We do not understand the problem.
Dan: We often split text strings across several lines to improve readability of the code. We don't understand your comment that the line break will have negative effects on visual rendering.

Martin: There are lots of little potential trouble points here. The visual rendering doesn't need to be perfect.

Richard: We don't see any big problems.

Dan: I will officially reject this one

Richard: We formally withdraw item 58.

E? withdrawn
59 2.2.1, description of priorities of xml:lang, name, variant,...: [59] It would be better to describe this clearly as priorities, i.e. to say that for voice selection, xml:lang has highest priority,...
Accepted with changes. We like the existing text and will keep it. However, we will also add (upfront) a description based on priorities as you suggest.
E ok (pending final text review)
60 2.2.3 What about <break> inside a word (e.g. for long words such as [60] German)? What about <break> in cases where words cannot clearly be identified (no spaces, such as in Chinese, Japanese, Thai). <break> should be allowed in these cases.
[60] Rejected. This is a tokenization issue. Tokens in SSML are delimited both by white space and by SSML elements. You can write a word as two separate words and it will have a break, you can insert an SSML element, or you can use stress marks externally. For Asian languages with characters without spaces to delimit words, if you insert SSML elements it automatically creates a boundary between words. You can use a similar approach for German, e.g. with "Fussbalweltmeisterschaft". If you insert a <break> in the middle it actually splits the word, but that's probably what you wanted: Fussbal<break>weltmeisterschaft. If you wish to insert prosodic controls, that would be handled better via an external lexicon which can provide stress markers, etc.
I'm confused. The reply says rejected, but then goes on to show an example of what we asked for. If a <break> automatically creates a boundary, then just say that it can be used in the middle of a word (or phrase in languages without spaces) and that's what happens.
Accepted. Inserting any element adds a lexical boundary, so while it is acceptable to insert a break in the middle of a word or phrase, this will create a new lexical boundary, effectively splitting the one word or phrase into two. We will clarify the relationship between words and tokens in the Introduction and that breaking one token into multiple tokens will likely affect how the processor treats it. A simple English example is "cup<break/>board"; the processor will treat this as the two words "cup" and "board" rather than as one word with a pause in the middle.
Dan describes why the VBWG rejected, then accepted, the suggestion. The main point is that <break> inside a token will split the token into two tokens that may be treated differently by the processor than the one token would have been. VBWG's proposal for modifications to the Introduction is accepted. Martin wants to see the text before final approval.
Q ok (pending final text review)
61 2.2.3 and 2.2.4: "x-high" and "x-low": the 'x-' prefix is part of [61] colloquial English in many parts of the world, but may be difficult to understand for non-native English speakers. Please add an explanation.
Accepted. We will add such an explanation.
E
62 2.2.4: Please add a note that customary pitch levels and [62] pitch ranges may differ quite a bit with natural language, and that "high",... may refer to different absolute pitch levels for different languages. Example: Japanese has general much lower pitch range than Chinese.
Accepted
E ok
63 2.2.4, 'baseline pitch', 'pitch range': Please provide definition/ [63] short explanation.
Accepted. We will add this.
E ok
64 2.2.4 'as a percent' -> 'as a percentage'
Accepted
E ok
65 2.2.4 What is a 'semitone'? Please provide a short explanation.
Accepted. We will add this.
E ok
66 2.2.4 In pitch contour, are white spaces allowed? At what places [66] exactly? In "(0%,+20)(10%,+30%)(40%,+10)", I would propose to allow whitespace between ')' and '(', but not elsewhere. This has the benefit of minimizing syntactict differences while allowing long contours to be formatted with line breaks.
Accepted
Q ok
67 2.2.4, bullets: Editorial nit. It may help the first time reader to [67] mention that 'relative change' is defined a little further down.
Accepted
E ok
68 2.2.4, 4th bullet: the speaking rate is set in words per minute. [68] In many languages what constitutes a word is often difficult to determine, and varies considerably in average length. So there have to be more details to make this work interoperably in different languages. Also, it seems that 'words per minute' is a nominal rate, rather than exactly counting words, which should be stated clearly. An much preferable alternative is to use another metric, such as syllables per minute, which has less unclarity (not
Accepted with changes. Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor- specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.
S ok (pending final text review)
69 2.2.4, 5th bullet: If the default is 100.0, how do you make it [69] louder given that the scale ranges from 0.0 to 100.0? (or, in other words, is the default to always shout?)
Maximum volume does not equal shouting. Shouting is actually a factor of several prosodic changes, only one of which is volume. Our internal poll determined that maximum volume was the default for most synthesis processors. The assumption is that you can a) reduce the volume within SSML and b) set the final true volume to anything you want through whatever general audio controls your audio system (PC volume control, speaker knob) has available.
Response seems to explain some, but doesn't seem conclusive. Also, is this explained in the spec?
What is inconclusive? Also, you had asked us a question rather than suggesting or implying a change in the spec. Are you asking for a particular change to the specification as well?
S?
70 2.2.4, Please state whether units such as 'Hz' are case-sensitive [70] or case-insensitive. They should be case-sensitive, because units in general are (e.g. mHz (milliHz) vs. MHz (MegaHz))
Accepted. Although the units are already marked as case- sensitive in the Schema, we will clarify in the text that such units are case-sensitive.
E
71 2.3.3 Please provide some example of <desc>
Accepted. We will add an example.
E ok
72 3.1 Requiring an XML declaration for SSML when XML itself [72] doesn't require an XML declaration leads to unnecessary discrepancies. It may be very difficult to check this with an off-the-shelf XML parser, and it is not reasonable to require SSML implementations to write their own XML parsers or modify an XML parser. So this requirement should be removed (e.g. by saying that SSML requires an XML declaration when XML requires it).
Accepted. This is a general problem that applies to all of the specifications from the Voice Browser Working Group. We will address it in a consistent manner across all of our specifications.
Seems okay, but not completely clear what will be done.
Oops, you're right. We agree with and accept your suggestion to remove this requirement.
S
73 3.3, last paragraph before 'The lexicon element' subtitle: [73] Please also say that the determination of what is a word may be language-specific.
Accepted. We will clarify this.
E
74 3.3 'type' attribute on lexicon element: What's this attribute used [74] for? The media type will be determined from the document that is found at the 'uri' URI, or not?
It is occasionally the case that no media type is available; some examples are an HTTP request that does not return a media type and a local file access. The "type" attribute can be used in this case to indicate the type of the document. Also, some schemes provide for content negotiation when multiple valid documents (in different formats) are available, in which case the "type" attribute functions as a preferred type indicator.
Will this explanation be added to the spec?
Although you did not originally request this explanation be added to the specification, we will add such an explanation to the document.
Q
75 4.1 'synthesis document fragment' -> 'speech synthesis document fragment'
Accepted
E ok
76 4.1 Conversion to stand-alone document: xml:lang should not [76] be removed. It should also be clear whether content of non-synthesis elements should be removed, or only the markup.
Accepted. Good point about xml:lang. We will modify the text to indicate that everything in our schema (including xml:lang, xml:base, etc.) is to be retained in the conversion and that all other non-synthesis namespace elements and their contents should be removed.
E ok (pending final text review)
77 4.4 'requirement for handling of languages': Maybe better to [77] say 'natural languages', to avoid confusion with markup languages. Clarification is also needed in the following bullet points.
Accepted. We will make this change.
E ok
78 4.5 This should say that a user agent has to support at least [78] one natural language.
Accepted. We will add this.
E
79 App A: 'http://www.w3c.org/music.wav': W3C's Web site is www.w3.org. [79] But this example should use www.example.org or www.example.com.
Accepted. We will correct this.
E
80 App B: 'synthesis DTD' -> 'speech synthesis DTD'
Accepted
E ok
81 App D: Why does this mentions 'recording'? Please remove or explain.
Accepted with changes. This was accidentally left in when originally copied from the VoiceXML specification. It will be corrected.
E ok
82 App E: Please give a reference for the application to the IETF/IESG/IANA [82] for the content type 'application/ssml+xml'.
Accepted. Good suggestion. This is a general problem that applies to all of the specifications from the Voice Browser Working Group. We will address it in a consistent manner across all of our specifications by providing the most appropriate and relevant references at the time of publication.
E ok
83 App F: 'Support for other phoneme alphabets.': What's a 'phoneme alphabet'?
See the <prosody> element (section 2.1.5). In short, it is a symbol set for representing the phonemic or phonetic units of a human language.
Please make sure this is clearly defined, and linked
We will add a link to section 2.1.5.
E
84 App F, last paragraph: 'Unfortunately, ... no standard for designating [84] regions...': This should be worded differently. RFC 3066 provides for the registration of arbitrary extensions, so that e.g. en-gb-accent-scottish and en-gb-accent-welsh could be registered.
Accepted. We will revise the text appropriately.
E ok
85 App F, bullet 3: I guess you already know that intonation [85] requirements can vary considerably across languages, so you'll need to cast your net fairly wide here.
No action requested for this document.
Do you have a list of all the deferred issues?
We maintain a list of all issues -- addressed, deferred, or otherwise.
E
86 App G: What is meant by 'input' and 'output' languages? This is the [86] first time this terminology is used. Please remove or clarify.
Accepted. This is old text. We will clarify.
E ok
87 App G: 'overriding the SSML Processor default language': There should [87] be no such default language. An SSML Processor may only support a single language, but that's different from assuming a default language.
Accepted. As with item 86, this is old text. We will correct this.
E ok
88 The appendices should be ordered so that the normative ones appear before the informative ones.
Accepted
E
89 This is an important topic that has been discussed with other groups since we did the review.

There are a number of elements that allow only PCDATA content and attributes containing text to be spoken (eg. the alias attribute of the <sub> element, and the <desc> element).

Use of PCDATA precludes the possibility of language change or bidi markup for a part of the text.

Proposed changes:

  1. Elements should always allow for bidi markup and language change to be applied.
  2. Attributes containing text that will be spoken should be converted to elements.

[Note: we have recently discussed this with the HTML WG wrt XHTML2.0 and they have agreed to take similar action as we are recommending here.]


Rejected. This is a new request well outside the timeframe for comments on this specification. We agree with the principle and will happily consider this request for a future version of SSML beyond 1.0.
Was this a late request? If yes, can we move this to the issues list for the next version?
Yes, this was a late request. We will happily consider it for versions of SSML beyond 1.0.
S