- From: Daniel C. Burnett <Daniel.Burnett@nuance.com>
- Date: Tue, 14 Aug 2007 07:09:22 -0400
- To: "Richard Ishida" <ishida@w3.org>
- Cc: <shuangzw@cn.ibm.com>, "Kazuyuki Ashimura" <ashimura@w3.org>, <public-i18n-core@w3.org>, "w3c-voice-wg" <w3c-voice-wg@w3.org>
Hi Richard, I18N-core, Thank you so much for your comments, and sorry for the late reply. Your comments suggested several useful things to us and, most importantly, made us aware that our explanations of voice selection and the relationship with xml:lang were sorely inadequate. My detailed replies are embedded below, preceded by [DB], and approximately represent the current views of the subgroup. -- dan -----Original Message----- From: Richard Ishida [mailto:ishida@w3.org] Sent: Tuesday, July 03, 2007 6:17 AM To: Daniel C. Burnett Cc: shuangzw@cn.ibm.com; 'Kazuyuki Ashimura'; public-i18n-core@w3.org Subject: RE: [ssml11] Second WD of SSML 1.1 and updated Requirements doc are published http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html Lots of useful i18n-related changes to this doc. Thanks. Here are some comments. I hope they help. I included some nit-like editorial points with the more substantive ones. =============== Status section "This document enhances SSML 1.0 [SSML] to provide better support for a broader set of languages." Presumably that is natural languages rather than markup languages? [DB] Yes. We will clarify this. =============== 1.5 URI http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html#S1.5 I think it would be better to define URI directly in terms of RFC 3987 or its successor than referring to the XML Schema definition. I suggest that you adopt a definition like that of XQuery. The XQuery definition reads: "Within this specification, the term URI refers to a Universal Resource Identifier as defined in [RFC3986] and extended in [RFC3987] with the new name IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as "Base URI" that are defined or referenced across the whole family of XML specifications." [DB] When the Voice Browser Working Group was creating the first versions of its specifications, we were encouraged to reference XML Schema, or XML, etc. rather than the RFCs themselves because those W3C documents were considered more stable, or at least more forwards-compatible. We did not want to create our own definitions, but rather refer to definitions created by others whose expertise in the area was likely to be greater than our own. Is the current approach within W3C changing to encourage direct references? ============ 3.1.2 xml:lang attribute http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html#S3.1.2 I suggest: s/to indicate the natural language of the content of the element/to indicate the natural language of the written content of the element/ [DB] Yes, we agree. I'm thinking it would be useful to say, specifically, that values must conform to BCP 47. Rather than the, to me, slightly weak sounding "BCP 47 can help in understanding how to use this attribute". [DB] See my reply to your URI comment above. ================ 3.1.8.2 w element http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html#S3.1.8.2 We recently sent a comment to the XQuery and XPath Full Text folks recommending that they drop the word 'word' in favour of 'token', since 'word' is such a complicated thing to define in many languages. I think the same probably applies here, eg. "to eliminate word segmentation ambiguities" should at least be word/token. [DB] We are currently leaning in this direction as well, but there is not yet complete agreement. The i18n WG will probably suggest also replacing the w element with a t element. [DB] This is a touchy subject. We have spent many hours over the past year discussing the name of this element. The name "w" aligns well with <p> and <s>, and it also suggests the common use for this element of marking words. However, we are considering adding <token> as a synonym or, more appropriately, rewording our document as you suggest to discuss tokens, defining a <token> element, and then defining <w> to be a synonym for <token>. I suggest: s/that do not use white-space as a boundary identifier/that do not use white-space as a token boundary identifier/ [DB] We agree. Note also that Thai does use space as a boundary identifier, but those boundaries are phrasal rather than token level. [DB] Agreed. Spec says: [[Thus, "<w><emphasis>hap</emphasis>py</w>" and "<w><emphasis> hap </emphasis> py</w>" would refer to the words "happy" and " hap py", respectively.]] I think the second example would be written more correctly as <w><emphasis>hap</emphasis> py</w>, with an initial space before the <w>. I'm not sure why the whitespace rules need to be different for <w>. Note, also, that including space before closing markup in some circumstances can cause problems for bidi text (see http://www.w3.org/International/questions/qa-bidi-space). [DB] Actually, the second example is what we intended, except that the result should have two spaces between the two p's: " hap py". Our example is intended to make clear that the non-markup contents of the <w> element are, all together, taken as the token to be looked up in the lexicon. This allows tokens containing white space to be defined even for languages that use white space as a token boundary. Outside of the <w> element, tokenization behavior, including white space collapsing or removal, depends upon the natural language being spoken (and perhaps the processor itself, in some circumstances) . The white space issue you mention with bidi text is a visual rendering issue, as we understand it, and therefore not directly relevant to SSML. However, we expect authors to pay close attention to the behavior of white space within <w> and believe that authors taking such care will also use bidi text appropriately. We will likely change the wording from "white space is significant" to "white space is preserved" to clarify our intent. Suggestion: s/xml:lang is a defined attribute on the w element to identify the language of the content./xml:lang is a defined attribute on the w element to identify the written language of the content./ [DB] Agreed. We will change this. Chinese is a little unusual wrt language tags. The first example on purple background includes xml:lang="zh-CN" - I think that if the examples were of Mandarin (Putonghua) Chinese that should be either zh-cmn or zh-Hans, or zh-cmn-Hans. (see http://people.w3.org/rishida/utils/subtags/index.php?searchtext=mandarin &sub mit=Search&searchtype=2 ) If you are describing the spoken language, I would go for zh-cmn, but I think xml:lang is used to describe the written content, for which zh-Hans is usually more appropriate. If the implementation will derive from xml:lang information about which language to set the voice in, then it would probably be necessary to say that this is, say, Putonghua (Mandarin), in which case you'd probably want to use zh-cmn-Hans. Of course the examples that follow seem to indicate that this would actually need to be Shanghaiese, for which the subtag is zh-wuu. Unfortunately, there is no provision at the moment for zh-wuu-Hans, although that is coming in the next version of BCP 47. [DB] We believe that using zh-Hans only may be sufficient for visual rendering but is not truly a description of the written content, since it is insufficient for even a human reader unambiguously to determine the intended language. As you suggest above, the processor will derive from xml:lang information about which language the voice will speak, but only in the same way a speaker of a language who could read the language would do so. Thus, it is appropriate to give both the script and the intended dialect or region if an author expects the written text to be interpreted as being from that dialect or region. In the current draft this has now properly been separated from the accent used to speak the language. ============= 3.2.1 voice element http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html#S3.2.1 "where both language and accent can be values like you would find in xml:lang" I think you should specify that values MUST be composed using BCP 47 - otherwise you leave the way open to interoperability problems. [DB] We agree that this wording needs to be more precise. We will likely use a matching algorithm from RFC4647 as suggested by Addison; see my next email. We will note that certain subtag values may be safely ignored by the processor. For example, the script subtag is irrelevant for accent indication. "optional attribute indicating the list of languages the voice can speak, with optional accent indication per language, or the empty string " After reading this through several times, I concluded that the empty string is an alternative to the accent indication (rather than allowing langauges="") - ie. that the language attribute has to contain something, but it could just be language tag(s). Is that correct? [DB] No. The languages attribute may have the empty string as a value, meaning that any voice that can read a language (any language!) with some accent (any accent!) is acceptable. The languages attribute may also contain one or more "language:accent" pairs where the ":accent" is optional. We will improve the wording in this section to make this clearer. If we have <voice languages="fr:zh"> and there is no voice that supports French with a Chinese accent, then presumably a voice that supports French will be a suitable fallback? If so, you should probably say that in the onvoicefailure section. [DB] We do not permit the fallback as you describe. If there is no voice that can read French with a Chinese accent, then an onvoicefailure will occur. If the author still wants limited control over language, he can use "priorityselect", which will allow language indication that an intelligent processor can use intelligently. The example on purple background says <voice gender="female" languages="en-US" ... rather than <voice gender="female" languages="en:en-US" ... Is this a mistake, or does it mean that accent should be specified with a single language tag where possible, and that the colon separator is only needed for accents that are not expressible in that way, eg. en:zh? [DB] This is not a mistake. It means that the author has no accent preference. In the example you reference, the voice may speak US English with a Chinese, Swahili, Urdu, etc. accent. If the author requires a particular accent, he must indicate it. In the required attribute "The default value for this attribute is "languages"." But if no languages attribute is defined, what is the default language? Is this the language specified by the xml:lang attribute? [DB] The default value for the languages attribute is the empty string, which means any language. Thus, in the default case, a voice may be selected without any consideration of the languages it can speak. I think it may be worth repeating in this section that the voice setting for language can be taken from the xml:lang information. I think it would also be useful to have a paragraph and example describing and illustrating the effects of the xml:lang and voice languages settings respectively, and how they cross over. [DB] The voice setting for language is not taken from the xml:lang information. The author specifically requests a voice that can read and speak a particular language, and this request is independent of the current value of xml:lang. I think what we should explain here is that a processor knows, for any given voice, which values of xml:lang that voice is intended to work with. The author is now able to indicate that he wants a voice that can work with/read a particular language. What the voice does with that language is then up to the voice, but vendors will likely do the obvious thing and have the voice speak the language that's written. It may be necessary to clarify what happens if only a fr voice is available but xml:lang says fr-CA and there is no <voice languages="fr"... [DB] I answered this above, but I agree that we should explain and give examples of what happens in this case. =============== 3.1.12 lang Element http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20 0706 11diff.html#S3.1.12 I'd vote for <span> as the name. Apart from anything else, that would allow for other uses that may arise in the future, not related to language. You never know... [DB] We asked for input on this point, so thank you. At this point we believe that it would be too confusing for developers used to SSML 1.0 because of the former convoluted and vague linkage between the voice element and xml:lang. By creating a new element, <lang>, we believe it will help authors to understand that language setting is separate from voice selection (except in the onlangfailure described a few points ago), and we believe it will make them more aware of language changes. In future versions of SSML it may be reasonable to add <span> to the language and use it for a variety of attributes as you suggest. ============ Other It may be worthwhile specifying expected behaviour when content is non-linguistic or undetermined. See http://www.w3.org/International/questions/qa-no-language [DB] Good suggestion. We will likely disallow both of these in our languages attribute because they have no meaning for us - we are not defining the language of the content, but which language(s) must be supported by a voice. RI ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/People/Ishida/ http://www.w3.org/International/ http://people.w3.org/rishida/blog/ http://www.flickr.com/photos/ishida/ > -----Original Message----- > From: Daniel C. Burnett [mailto:Daniel.Burnett@nuance.com] > Sent: 02 July 2007 15:08 > To: Richard Ishida > Cc: shuangzw@cn.ibm.com; Kazuyuki Ashimura > Subject: RE: [ssml11] Second WD of SSML 1.1 and updated > Requirements doc are published > > Richard, > > Have you had a chance to look at the specification yet? Our > subgroup meeting in China begins on Wednesday, 4 July (in two > days), and I would appreciate any early feedback you have > that we might be able to discuss. > > Thanks, > > Dan
Received on Tuesday, 14 August 2007 11:09:36 UTC