Comments on Character Model for the World Wide Web 1.0: Normalization from Jim Melton on 2005-12-20 (www-i18n-comments@w3.org from December 2005)

From: Jim Melton <jim.melton@acm.org>
Date: Tue, 20 Dec 2005 16:57:44 -0700
To: www-i18n-comments@w3.org
Cc: jim.melton@acm.org, w3c-xsl-query@w3.org
Message-Id: <6.2.0.14.2.20051220152420.02b31388@rgmstimap.oraclecorp.com>
Gentlepeople,

I have found a few cycles to review the Working Draft of the Character 
Model for the World Wide Web 1.0: Normalization, (hereinafter 
"Normalization") dated 27 October, 2005.  These comments are personal, and 
do not necessarily represent the opinions of the XML Query Working Group, 
the XSL Working Group, or Oracle Corp.  If some or all of these comments 
are endorsed by any of those organizations, then you will receive them 
separately as comments from the appropriate organization.

(1) In section 2, Conformance, the list of specification conformance 
criteria include: "make it a conformance requirement for implementations to 
conform to this document", and "make it a conformance requirement for 
content to conform to this document".  Would you clarify (perhaps only as a 
response to this message) whether or not the XQuery 1.0, XPath 2.0, and 
XSLT 2.0 suite of specifications would be cited as non-conforming to this 
specification if (as I believe to be the case) they do not contain an 
explicit statement of those two criteria?

(2) In section 3.2.3, Include-normalized text, bullet 2 uses the phrase 
"clause 1 above".  I believe that most readers will better understand your 
meaning if you replace that with "bullet 1 above" or "list item 1 
above".  To many readers, the word "clause" refers either to a major 
subdivision of a document (e.g., a chapter) or to a relatively short phrase 
such as a portion of a sentence (e.g., the noun clause).

(3) In section 3.2.4, Fully-normalized text, first numbered list, bullet 1 
says that a composing character is "the second character in the canonical 
decomposition mapping of some character".  If there are characters in 
Unicode that are made of a "base character" plus two or more composing 
characters (I cannot claim to be positive that such characters exist, but I 
think that Hangul characters are often decomposed into three or more Jamo; 
there may be other examples), then surely "a composing character" would be 
"each character after the first in the canonical decomposition mapping of 
some character".

(4) In section 3.2.4, Fully-normalized text, first numbered list, bullet 1 
refers to "some character that is not listed in the Composition Exclusion 
Table defined in [UTR #15]". However, following the link to the most recent 
version of UTR #15, the section of that document whose title is 
"Composition Exclusion Table" contains neither a table nor a list of 
characters.  While this is an apparent failure of UTF #15, the dependence 
on that section of UTR #15 cascades that failure into 
Normalization.  However, there is (in section 6 of UTF #15) a (not terribly 
obvious) reference to "the Composition Exclusion Table [Exclusions]".  The 
References entry with that name (Exclusions) contains pointers to several 
versions of such a table, the latest of which is available at 
<http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt>http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt. 
It would have seemed a Very Good Idea for Normalization to point directly 
to this file, perhaps in addition to the reference directly to UTF #16 
section 6.

(5) In section 3.2.4, Fully-normalized text, second numbered list, bullet 2 
uses the phrase "clause 1 above".  I believe that most readers will better 
understand your meaning if you replace that with "bullet 1 above" or "list 
item 1 above".  To many readers, the word "clause" refers either to a major 
subdivision of a document (e.g., a chapter) or to a relatively short phrase 
such as a portion of a sentence (e.g., the noun clause).

(6) In section 3.2.4, Fully-normalized text, the paragraph beginning 
"Identification of the constructs..." includes the statement that "it is 
the responsibility of the specification for a language to specify exactly 
what constitutes a relevant construct".  Could you please clarify whether 
or not the XQuery 1.0, XPath 2.0, and XSLT 2.0 suite of specifications 
would be cited as non-conforming to this specification if (as I believe to 
be the case) they do not contain any such explicit specification?

(7) In section 3.2.7, Certified and suspect text, the NOTE begins with the 
statement "To normalize text, it is in general sufficient to store the last 
seen character...".  Perhaps I've missed something important earlier in 
this specification, but I have no idea what that statement means.  One way 
of explaining it is to use the example of text "C combining-cedilla".  When 
processing that text, I store the last seen character 
(combining-cedilla).  And, violá, the text is normalized.  But that 
obviously is not the case.  So what does that statement mean?  Could it be 
expressed in a less ambiguous manner?

(8) In section 3.4, Responsibility for normalization, item C303 includes an 
Example that uses the notations "xf:concat" and "xf:substring".  In both 
cases (because this document does not define any namespace prefixes 
associated with the namespace name associated with XPath/XQuery functions), 
the "xf" should be replaced with "fn", which is the conventional prefix 
used for that namespace.

(9) In section 4, String identity matching, item C312, list item 1 includes 
the statement "In accordance with section 
<http://www.w3.org/TR/2005/WD-charmod-norm-20051027/#sec-Normalization>3 
Normalization, this step MUST be performed by the producers of the strings 
to be compared."  But section 3 does not make such a requirement (it did so 
in earlier drafts, but has been changed in this draft).  At the very least, 
that use of "MUST" must (pun intended) be replaced by 
"SHOULD".  Furthermore, the requirement to use "Early uniform 
normalization" might be correct because of the use of "as if" in the 
preceding paragraph, but (as section 3 makes clear) late normalization will 
produce identical results.

(10) In appendix A, the reference to XQuery Operators includes an outdated 
list of editors.  Jonathan Robie is no longer cited as an editor of that 
specification.  Furthermore, the most recent edition is now dated 4 
November, 2005, and is a Candidate Recommendation.  (Of course, because 
Normalization was published earlier than that date, you could not have 
known this fact; the next publication of Normalization should make this 
change.)

(11) In Appendix B, the final NOTE: says that certain characters may be 
displayed as a blank or as a blank rectangle.  In some situations (e.g., 
Firefox 1.0.4 on my system without any font that covers Sinhala, a question 
mark ("?") is displayed.  It might be appropriate to include that 
possibility in this NOTE.


Hope this helps,
    Jim

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Co-Chair, W3C XML Query WG; F&O (etc.) editor    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA          Personal email: jim at melton dot name
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================
Received on Tuesday, 20 December 2005 23:58:41 UTC