RE: [ssml11] Second WD of SSML 1.1 and updated Requirements doc are published from Daniel C. Burnett on 2007-08-14 (public-i18n-core@w3.org from July to September 2007)

From: Daniel C. Burnett <Daniel.Burnett@nuance.com>
Date: Tue, 14 Aug 2007 07:09:22 -0400
To: "Richard Ishida" <ishida@w3.org>
Cc: <shuangzw@cn.ibm.com>, "Kazuyuki Ashimura" <ashimura@w3.org>, <public-i18n-core@w3.org>, "w3c-voice-wg" <w3c-voice-wg@w3.org>
Message-ID: <2AB5541EB33172459EE430FFB66B1EE908731DFD@BN-EXCH01.nuance.com>
Hi Richard, I18N-core,

Thank you so much for your comments, and sorry for the late reply.
Your comments suggested several useful things to us and, most
importantly, made us aware that our explanations of voice selection and
the relationship with xml:lang were sorely inadequate.

My detailed replies are embedded below, preceded by [DB], and
approximately represent the current views of the subgroup.

-- dan

-----Original Message-----
From: Richard Ishida [mailto:ishida@w3.org] 
Sent: Tuesday, July 03, 2007 6:17 AM
To: Daniel C. Burnett
Cc: shuangzw@cn.ibm.com; 'Kazuyuki Ashimura'; public-i18n-core@w3.org
Subject: RE: [ssml11] Second WD of SSML 1.1 and updated Requirements doc
are published

http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html 
Lots of useful i18n-related changes to this doc. Thanks. Here are some
comments. I hope they help. I included some nit-like editorial points
with
the more substantive ones.


===============
Status section
"This document enhances SSML 1.0 [SSML] to provide better support for a
broader set of languages."

Presumably that is natural languages rather than markup languages?

[DB] Yes.  We will clarify this.

===============
1.5 URI
http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html#S1.5

I think it would be better to define URI directly in terms of RFC 3987
or
its successor than referring to the XML Schema definition.  

I suggest that you adopt a definition like that of XQuery. The XQuery
definition reads:

"Within this specification, the term URI refers to a Universal Resource
Identifier as defined in [RFC3986] and extended in [RFC3987] with the
new
name IRI. The term URI has been retained in preference to IRI to avoid
introducing new names for concepts such as "Base URI" that are defined
or
referenced across the whole family of XML specifications."

[DB] When the Voice Browser Working Group was creating the first
versions of its specifications, we were encouraged to reference XML
Schema, or XML, etc. rather than the RFCs themselves because those W3C
documents were considered more stable, or at least more
forwards-compatible.  We did not want to create our own definitions, but
rather refer to definitions created by others whose expertise in the
area was likely to be greater than our own.
Is the current approach within W3C changing to encourage direct
references?

============
3.1.2 xml:lang attribute
http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html#S3.1.2

I suggest: s/to indicate the natural language of the content of the
element/to indicate the natural language of the written content of the
element/

[DB] Yes, we agree.

I'm thinking it would be useful to say, specifically, that values must
conform to BCP 47.  Rather than the, to me, slightly weak sounding "BCP
47
can help in understanding how to use this attribute".

[DB] See my reply to your URI comment above.

================
3.1.8.2 w element
http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html#S3.1.8.2

We recently sent a comment to the XQuery and XPath Full Text folks
recommending that they drop the word 'word' in favour of 'token', since
'word' is such a complicated thing to define in many languages.  I think
the
same probably applies here, eg. "to eliminate word segmentation
ambiguities"
should at least be word/token.

[DB] We are currently leaning in this direction as well, but there is
not yet complete agreement.

The i18n WG will probably suggest also replacing the w element with a t
element.

[DB] This is a touchy subject.  We have spent many hours over the past
year discussing the name of this element.  The name "w" aligns well with
<p> and <s>, and it also suggests the common use for this element of
marking words.  However, we are considering adding <token> as a synonym
or, more appropriately, rewording our document as you suggest to discuss
tokens, defining a <token> element, and then defining <w> to be a
synonym for <token>.

I suggest: s/that do not use white-space as a boundary identifier/that
do
not use white-space as a token boundary identifier/

[DB] We agree.

Note also that Thai does use space as a boundary identifier, but those
boundaries are phrasal rather than token level.

[DB] Agreed.

Spec says: [[Thus, "<w><emphasis>hap</emphasis>py</w>" and
"<w><emphasis>
hap </emphasis> py</w>" would refer to the words "happy" and " hap py",
respectively.]]

I think the second example would be written more correctly as
<w><emphasis>hap</emphasis> py</w>, with an initial space before the
<w>.
I'm not sure why the whitespace rules need to be different for <w>.
Note,
also, that including space before closing markup in some circumstances
can
cause problems for bidi text (see
http://www.w3.org/International/questions/qa-bidi-space).

[DB] Actually, the second example is what we intended, except that the
result should have two spaces between the two p's: " hap  py".  Our
example is intended to make clear that the non-markup contents of the
<w> element are, all together, taken as the token to be looked up in the
lexicon.  This allows tokens containing white space to be defined even
for languages that use white space as a token boundary.  Outside of the
<w> element, tokenization behavior, including white space collapsing or
removal, depends upon the natural language being spoken (and perhaps the
processor itself, in some circumstances) .  The white space issue you
mention with bidi text is a visual rendering issue, as we understand it,
and therefore not directly relevant to SSML.  However, we expect authors
to pay close attention to the behavior of white space within <w> and
believe that authors taking such care will also use bidi text
appropriately.
We will likely change the wording from "white space is significant" to
"white space is preserved" to clarify our intent.

Suggestion: s/xml:lang is a defined attribute on the w element to
identify
the language of the content./xml:lang is a defined attribute on the w
element to identify the written language of the content./

[DB] Agreed.  We will change this.

Chinese is a little unusual wrt language tags.

The first example on purple background includes xml:lang="zh-CN" - I
think
that if the examples were of Mandarin (Putonghua) Chinese that should be
either zh-cmn or zh-Hans, or zh-cmn-Hans. (see
http://people.w3.org/rishida/utils/subtags/index.php?searchtext=mandarin
&sub
mit=Search&searchtype=2 )

If you are describing the spoken language, I would go for zh-cmn, but I
think xml:lang is used to describe the written content, for which
zh-Hans is
usually more appropriate. If the implementation will derive from
xml:lang
information about which language to set the voice in, then it would
probably
be necessary to say that this is, say, Putonghua (Mandarin), in which
case
you'd probably want to use zh-cmn-Hans.

Of course the examples that follow seem to indicate that this would
actually
need to be Shanghaiese, for which the subtag is zh-wuu.  Unfortunately,
there is no provision at the moment for zh-wuu-Hans, although that is
coming
in the next version of BCP 47.

[DB] We believe that using zh-Hans only may be sufficient for visual
rendering but is not truly a description of the written content, since
it is insufficient for even a human reader unambiguously to determine
the intended language.  As you suggest above, the processor will derive
from xml:lang information about which language the voice will speak, but
only in the same way a speaker of a language who could read the language
would do so.  Thus, it is appropriate to give both the script and the
intended dialect or region if an author expects the written text to be
interpreted as being from that dialect or region.  In the current draft
this has now properly been separated from the accent used to speak the
language.


=============
3.2.1 voice element
http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html#S3.2.1

"where both language and accent can be values like you would find in
xml:lang"
I think you should specify that values MUST be composed using BCP 47 -
otherwise you leave the way open to interoperability problems.

[DB] We agree that this wording needs to be more precise.  We will
likely use a matching algorithm from RFC4647 as suggested by Addison;
see my next email.
We will note that certain subtag values may be safely ignored by the
processor.  For example, the script subtag is irrelevant for accent
indication.

"optional attribute indicating the list of languages the voice can
speak,
with optional accent indication per language, or the empty string " 
After reading this through several times, I concluded that the empty
string
is an alternative to the accent indication (rather than allowing
langauges="") - ie. that the language attribute has to contain
something,
but it could just be language tag(s).  Is that correct?  

[DB] No.  The languages attribute may have the empty string as a value,
meaning that any voice that can read a language (any language!) with
some accent (any accent!) is acceptable.  The languages attribute may
also contain one or more "language:accent" pairs where the ":accent" is
optional.
We will improve the wording in this section to make this clearer.

If we have <voice languages="fr:zh"> and there is no voice that supports
French with a Chinese accent, then presumably a voice that supports
French
will be a suitable fallback?  If so, you should probably say that in the
onvoicefailure section.

[DB] We do not permit the fallback as you describe. If there is no voice
that can read French with a Chinese accent, then an onvoicefailure will
occur.  If the author still wants limited control over language, he can
use "priorityselect", which will allow language indication that an
intelligent processor can use intelligently.

The example on purple background says <voice gender="female"
languages="en-US" ... rather than <voice gender="female"
languages="en:en-US" ...

Is this a mistake, or does it mean that accent should be specified with
a
single language tag where possible, and that the colon separator is only
needed for accents that are not expressible in that way, eg. en:zh?

[DB] This is not a mistake.  It means that the author has no accent
preference.  In the example you reference, the voice may speak US
English with a Chinese, Swahili, Urdu, etc. accent.  If the author
requires a particular accent, he must indicate it.


In the required attribute "The default value for this attribute is
"languages"."  But if no languages attribute is defined, what is the
default
language?  Is this the language specified by the xml:lang attribute?  

[DB] The default value for the languages attribute is the empty string,
which means any language.  Thus, in the default case, a voice may be
selected without any consideration of the languages it can speak.

I think it may be worth repeating in this section that the voice setting
for
language can be taken from the xml:lang information. I think it would
also
be useful to have a paragraph and example describing and illustrating
the
effects of the xml:lang and voice languages settings respectively, and
how
they cross over.

[DB] The voice setting for language is not taken from the xml:lang
information.  The author specifically requests a voice that can read and
speak a particular language, and this request is independent of the
current value of xml:lang.  I think what we should explain here is that
a processor knows, for any given voice, which values of xml:lang that
voice is intended to work with.  The author is now able to indicate that
he wants a voice that can work with/read a particular language.  What
the voice does with that language is then up to the voice, but vendors
will likely do the obvious thing and have the voice speak the language
that's written.

It may be necessary to clarify what happens if only a fr voice is
available
but xml:lang says fr-CA and there is no <voice languages="fr"...

[DB] I answered this above, but I agree that we should explain and give
examples of what happens in this case.

===============
3.1.12 lang Element
http://www.w3.org/Voice/2007/speech-synthesis11/WD-speech-synthesis11-20
0706
11diff.html#S3.1.12

I'd vote for <span> as the name. Apart from anything else, that would
allow
for other uses that may arise in the future, not related to language.
You
never know...

[DB] We asked for input on this point, so thank you.  At this point we
believe that it would be too confusing for developers used to SSML 1.0
because of the former convoluted and vague linkage between the voice
element and xml:lang.  By creating a new element, <lang>, we believe it
will help authors to understand that language setting is separate from
voice selection (except in the onlangfailure described a few points
ago), and we believe it will make them more aware of language changes.
In future versions of SSML it may be reasonable to add <span> to the
language and use it for a variety of attributes as you suggest.

============
Other

It may be worthwhile specifying expected behaviour when content is
non-linguistic or undetermined.  See
http://www.w3.org/International/questions/qa-no-language

[DB] Good suggestion.  We will likely disallow both of these in our
languages attribute because they have no meaning for us - we are not
defining the language of the content, but which language(s) must be
supported by a voice.

RI


============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/People/Ishida/
http://www.w3.org/International/
http://people.w3.org/rishida/blog/
http://www.flickr.com/photos/ishida/
 
 

> -----Original Message-----
> From: Daniel C. Burnett [mailto:Daniel.Burnett@nuance.com] 
> Sent: 02 July 2007 15:08
> To: Richard Ishida
> Cc: shuangzw@cn.ibm.com; Kazuyuki Ashimura
> Subject: RE: [ssml11] Second WD of SSML 1.1 and updated 
> Requirements doc are published
> 
> Richard,
> 
> Have you had a chance to look at the specification yet?  Our 
> subgroup meeting in China begins on Wednesday, 4 July (in two 
> days), and I would appreciate any early feedback you have 
> that we might be able to discuss.
> 
> Thanks,
> 
> Dan
Received on Tuesday, 14 August 2007 11:09:36 UTC