WSArch WG review of Charmod LC #2 from Yin Leng Husband on 2002-05-31 (www-i18n-comments@w3.org from May 2002)

From: Yin Leng Husband <Yin-Leng.Husband@hp.com>
Date: Fri, 31 May 2002 15:27:20 +1000
To: www-i18n-comments@w3.org
Cc: w3c-ws-arch@w3.org
Message-ID: <E74B412A1B5FD211AD6C0000F87C38AD045EC7@ozyexc1.itg.qvar.cpqcorp.net>
Re: Last Call # 2 for the Character Model for the World Wide Web
Comments are pertinent to WD 30 April 2002 at:
http://www.w3.org/TR/2002/WD-charmod-20020430/
<http://www.w3.org/TR/2002/WD-charmod-20020430/> 
 
We have found this specification a good reference source on the character
model, so a high proportion of the review comments are editorial in nature
coming from the perspective of a reader learning about character encoding
and normalization issues.
 
The comments are categorized as requested
(Substantive,Editorial,Typo,Question,Other)
--
 

	

1.	Type: E 


*	1.1 Goals and Scope, last paragraph

1.	
o        "Since other W3C specifications will be based on some of the
provisions of this document, without repeating them, software developers
implementing W3C specifications must conform to these provisions."

	o        Unclear what "these provisions" (end of sentence) are since
the first part of the sentence refers to only "some of the provisions".
That is, should software developers implementing W3C specifications conform
to some or all of these provisions?

2.	Type: Q 


*	1.2 Background, 3rd paragraph, 2nd bullet

	o        "covers the widest possible range,"

	o        Unicode covers the widest possible range of what?
Characters? Languages? Scripts? Writing notations?

3.	Type: E 


*	1.2 Background, 3rd paragraph, 3rd bullet

	o        "provides a way of referencing characters independent of
the encoding of a resource,"

	o        Unclear what the "resource" is.   What is the relationship
between the characters being referenced and the "resource"?

	o        Is this the intent? - "provides a way to reference
characters independent of the encoding of the characters," 

4.	Type: E 


*	1.2 Background, 4th paragraph, last sentence

	o        "Unicode now serves as a common reference for W3C
specifications and applications."

	o        Unclear what sort of "reference" is meant.

	o        Is this the intent? - "Unicode now serves as a common
reference character set for W3C specifications and applications" 

5.	Type: E 


*	1.2 Background, 8th paragraph, last bullet

	o        "Use of control codes for various purposes (e.g.
bidirectionality control, symmetric swapping, etc.)."

	o        It would be useful to have links to reference material that
explain the issues.

	o        E.g. "Use of control codes for various purposes (e.g.
bidirectionality control [Unicode Standard 13.2], symmetric swapping
[Unicode Standard 13.3], etc.)."

6.	Type: E 


*	1.2 Background, 9th paragraph, 1st sentence

	o        "It should be noted that such properties also exist in
legacy encodings (where legacy encoding is taken to mean any character
encoding not based on Unicode), and in many cases have been inherited by
Unicode in one way or another from such legacy encodings."

	o        Unclear what "such properties" are.  The previous sentence
talks about "aspects of Unicode" with no mention of "properties".

	o        Is this the intent? - "It should be noted that such aspects
also exist in legacy encodings (where legacy encoding is taken to mean any
character encoding not based on Unicode), and in many cases have been
inherited by Unicode in one way or another from such legacy encodings."

7.	Type: E 


*	2 Conformance, 1st  NOTE, 1st sentence

	o        "RFC 2119 makes it clear that requirements that use SHOULD
are not optional ..."

	o        Inconsistent usage of term "requirements".  The first
paragraph of this Conformance section makes a distinction between
"requirements" and "recommendations".  It says that "requirements are
expressed using the key words "MUST", ... etc.".  This NOTE talks of
"requirements that use SHOULD ..."

8.	Type: S 


*	2 Conformance, 3rd Paragraph, last sentence

	o        " [S] [I] [C] In order to conform to this document,
specifications MUST NOT violate any requirements preceded by [S], software
MUST NOT violate any requirements preceded by [I], and content MUST NOT
violate any requirements preceded by [C]."

	o        How will conformance be enforced?  Are the the conformance
requirements in this document testable for violations?

9.	Type: S 


*	2 Conformance, 5th Paragraph, 1st sentence

	o        "[S] If an existing W3C specification does not conform to
the requirements in this document, then the next version of that
specification SHOULD be modified in order to conform"

	o        This lowered (to SHOULD) conformance requirement seems to
contradict that in the preceding paragraph which states that "[S] Every W3C
specification MUST conform to the requirements applicable to specifications,
..."

10.	Type: E 


*	2 Conformance, 5th Paragraph, 1st sentence

	o        "[S] If an existing W3C specification does not conform to
the requirements in this document, then the next version of that
specification SHOULD be modified in order to conform"

	o        Current wording says that in order to conform, the next
version is to be modified, i.e. without stating nature of modification.

	o        Is this the intent? - "[S] If an existing W3C specification
does not conform to the requirements in this document, then the next version
of that specification SHOULD be modified so that it then becomes
conformant."

11.	Type: E 


*	2 Conformance, 6th Paragraph, last sentence

	o        "[I] Where this specification contains a procedural
description, it MUST be understood as a way to specify the desired external
behavior. Implementations MAY use other ways of achieving the same results,
as long as observable behavior is not affected."

	o        "way" in the first sentence refers to "a way to specify"
whereas in the second sentence, the "other ways" are "ways of achieving"
what is specified.  Also current wording "as long as observable behavior is
not affected" is probably not the correct requirement.

	o        Is this the intent? - "[I] Where this specification
contains a procedural description, it MUST be understood as a way to specify
the desired external behavior. Implementations MAY use different means of
achieving the same results, as long as observable behavior is as described."

12.	Type: E 


*	3.1.1 Introduction, 2nd EXAMPLE, 1st  sentence

	o        "Korean Hangul is a featural syllabary ..."

	o        Would be helpful to define "featural syllabary" and explain
distinction between a "syllabary" and "featural syllabary".  The 1st and 2nd
examples give the impression that the distinction is in arranging "into
square syllabic blocks".

13.	Type: E 


*	3.1.1 Introduction, 2nd EXAMPLE, 1st  sentence

	o        "... that combines symbols for individual sounds of the
language ..."

	o        Are these "individual sounds of the language" phonemes or
syllables?

	o        Is this the intent? - "... that combines symbols for
individual phonemes [or syllables] of the language ..."

14.	Type: E 


*	3.1.1 Introduction, 3rd EXAMPLE, 1st  sentence

	o        "Indic scripts are abugidas."

	o        Would be helpful to indicate definition of "abugidas"
explicitly.  E.g. "Indic scripts are abugidas where each consonant letter
carries an inherent vowel that is eliminated or replaced using semi-regular
or irregular ways to combine consonants and vowels into clusters."

15.	Type: E 


*	3.1.1 Introduction, 4th EXAMPLE, 1st  sentence

	o        "Arabic script is an example of an abjad."

	o        Would be helpful to indicate definition of "abjad"
explicitly.  E.g. "Arabic script is an example of an abjad where short vowel
sounds are typically not written at all."

16.	Type: E 


*	3.1.1 Introduction, 2nd last paragraph, 1st  sentence

	o        "The developers of W3C specifications, and the developers
of software based on those specifications, are likely to be more familiar
with usages they have experienced and less familiar with the wide variety of
usages in an international context."

	o        In both instances of "usages", it is unclear "usages" of
what are intended.

17.	Type: S 


*	3.1.3 Units of visual rendering, 3rd paragraph, 1st  sentence

	o        "[S] [I] Specifications and software MUST NOT assume a
one-to-one mapping between character codes and units of displayed text."

	o        Inconsistency issue?  This sentence speaks of mapping
between "character codes" whereas the third sentence of the first paragraph
of 3.1.3 (There is not a one-to-one correspondence between characters and
glyphs) speaks of mapping between "characters", not "character codes".
Also, in all the other 3.1.x sections, the [S][I] requirements are about non
one-to-one correspondence between "characters", not "character codes".

18.	Type: E 


*	3.1.3 Units of visual rendering, 5th paragraph, 3rd sentence

	o         "The Unicode Standard [Unicode]
<http://www.w3.org/TR/2002/WD-charmod-20020430/#unicode#unicode>  requires
that characters be stored and interchanged in logical order."

	o        Would be helpful to define "logical order" or to provide
link to reference material such as Unicode Standard, Section 2.2 where it is
defined.

19.	Type: E 


*	3.1.5 Units of collation, 5th EXAMPLE, 1st  sentence

	o         "In Thai the sequence U+0E44 U+0E01 must be sorted as if
it was written U+0E01 U+0E44."

	o        Would be helpful to show the actual glyphs for U+0E44 and
U+0E01.

20.	Type: E 


*	3.1.7 Summary, 1st paragraph, 2nd and 3rd sentences

	o         "In the context of the digital representations of text, a
character can be defined informally as a small logical unit of text. Text is
then defined as sequences of characters."

	o        "Character" and "text" are defined circularly.

21.	Type: S 


*	3.6.2 Character encoding identification, 9th paragraph, 2nd sentence

	o        "[S] Specifications MAY define either UTF-8 or UTF-16 as a
default encoding form (or both if they define suitable means of
distinguishing them), but they MUST NOT use any other character encoding as
a default."

	o        Since specifications "MUST NOT use any other character
encoding as a default" other than "either UTF-8 or UTF-16" should the
beginning of the sentence be "[S] Specifications MUST define either UTF-8 or
UTF-16 as a default encoding form... " ?

22.	Type: S 


*	3.6.2 Character encoding identification, 9th paragraph, last
sentence

	o         "[S] Specifications MUST NOT propose the use of heuristics
to determine the encoding of data."

	o        It would be helpful to either give examples of the
undesirable "heuristics" or the reasons for banning "use of heuristics".
Would the absence of a BOM in UTF-8 encoding be considered use of heuristics
for identifying encoding?

23.	Type: E 


*	3.6.2 Character encoding identification, 12th paragraph, last
sentence

	o        "[I] On interfaces to other protocols, software SHOULD
support conversion ..."

	o        In the phrase "to other protocols", which is the base
protocol that the "other protocols" are being distinguished from?

	o        Is this the intent? - "[I] On interfaces to protocols,
software SHOULD support conversion ..."

24.	Type: S 


*	3.6.2 Character encoding identification, 12th paragraph, last
sentence

	o        "[I] On interfaces to other protocols, software SHOULD
support conversion between
<http://www.w3.org/TR/2002/WD-charmod-20020430/#Unicode_Encoding_Form#Unicod
e_Encoding_Form> Unicode encoding forms as well as any other necessary
conversions."

	o        Should it be "between
<http://www.w3.org/TR/2002/WD-charmod-20020430/#Unicode_Encoding_Form#Unicod
e_Encoding_Form> Unicode encoding forms" or "to
<http://www.w3.org/TR/2002/WD-charmod-20020430/#Unicode_Encoding_Form#Unicod
e_Encoding_Form> Unicode encoding forms" or "both between and to
<http://www.w3.org/TR/2002/WD-charmod-20020430/#Unicode_Encoding_Form#Unicod
e_Encoding_Form> Unicode encoding forms"?

25.	Type: Q 


*	3.7 Character Escaping, 1st  paragraph, 3rd sentence

	o        "There is also a need, often satisfied by the same or
similar mechanisms, to express characters not directly representable in the
character encoding of instances of the language."

	o        Why "instances of the language" and not just "the language"
?

26.	Type: Q 


*	3.7 Character Escaping, 1st  paragraph, last sentence

	o        " ... a language's syntax, which is itself expressed as
characters represented at the character encoding level."

	o        Why is a language's syntax expressed as characters
"represented at the character encoding level" and not just as characters in
the sense of abstract symbols?

27.	Type: Q 


*	3.7 Character Escaping, 4th [S] requirement, 2nd and last sentences

	o        "Escape syntaxes where the end is determined by a character
outside the set of characters admissible in the character escape itself
SHOULD be avoided. ... Forms like SPREAD's &UABCD; [SPREAD]
<http://www.w3.org/TR/2002/WD-charmod-20020430/#spread#spread>  or XML's
&#xhhhh;, where the character escape is explicitly terminated by a
semicolon, are much better."

	o        The examples of good forms ("where the character escape is
explicitly terminated by a semicolon") in the last sentence seem to exhibit
the characteristics ("where the end is determined by a character outside the
set of characters admissible in the character escape itself") of escape
syntaxes that SHOULD be avoided.

28.	Type: E 


*	3.7 Character Escaping, 6th [S] requirement, 1st sentence

	o        "[S] Escaped characters SHOULD be acceptable wherever
unescaped characters are; ..."

	o        What are "unescaped characters"?  Any character not
expressed in the escaping mechanism?  Seems to say that escaped characters
SHOULD be acceptable wherever a character is acceptable (since a character
normally is not expressed in the escaping mechanism).

	o        Is this the intent? - "[S] Escaped characters SHOULD be
acceptable wherever their unescaped forms are; ..."

29.	Type: E 


*	3.7 Character Escaping, 6th [S] requirement, last sentence

	o        "In particular, escaped characters SHOULD be acceptable in
identifiers and comments..."

	o        What if the identifier syntax is defined to be of a set
that does not include the character which is escaped?

30.	Type: E 


*	4.2.3 Fully-normalized text, 5th paragraph, last sentence

	o        "Many languages will benefit from defining more
boundaries..."

	o        It would be helpful to give examples of the "more
boundaries".

31.	Type: E 


*	4.3.1 General Examples, 3rd paragraph, 1st  sentence

	o        "The string suc¸on (U+0073 U+0075 U+0063 U+0327 U+006F
U+006E), where U+0327 is the COMBINING CEDILLA, encoded in a Unicode
encoding form, is neither ..."

	o        The string ...  "is not ..." because there is no 'nor'
alternative.

32.	Type: S 


*	4.3.1 General Examples, 5th paragraph, 1st  sentence

	o        "...the string suc¸on (U+0073 U+0075 U+0063 U+0327 U+006F
U+006E) which is not include-normalized ('c¸' is replaceable by 'ç')."

	o        Should it be this? - "...the string suc¸on (U+0073 U+0075
U+0063 U+0327 U+006F U+006E) which is not Unicode-normalized ('c¸' is
replaceable by 'ç')."

	

 

Regards, 
Yin Leng Husband

on behalf of Web Services Architecture WG
Received on Friday, 31 May 2002 01:19:13 UTC