UAX #29 comments from Addison Phillips on 2008-01-22 (public-i18n-core@w3.org from January to March 2008)

From: Addison Phillips <addison@yahoo-inc.com>
Date: Tue, 22 Jan 2008 09:51:36 -0800
To: public-i18n-core@w3.org
Message-ID: <47962D28.505@yahoo-inc.com>
All,

My comments follow.

~Addison

1. The replacement of the term "grapheme cluster" with term "character", 
starting in the introduction and proceeding through the document, seems 
to fly in the face of standard Unicode terminology and produces a 
significant problem. The term "character", as usually understood in 
Unicode contexts, refers to a logical character i.e. a code point. By 
using the term interchangeably with "grapheme cluster", we introduce 
confusion.

I grant that, in the introduction, the unfamiliar term "grapheme 
cluster" needs to be defined and its relationship to "user-perceived 
characters" spelled out. But the wholesale use of "character" is a bad 
choice.

2. Section 3 (editorial). The sentence starting "Historically, the 
Unicode Standard originally provided for grapheme clusters" is 
redundant. Either say "historically" or say "originally".

3. Section 3 editorial note. XDGC vs. DGC. The question is whether 
default grapheme cluster should be "redefined" to include the additional 
characters in an XDGC or whether the two should remain distinct.

On the one hand, Unicode continues to add characters, including 
combining marks, so the definition of a DGC will change over time. So I 
could envision that adding existing characters to the definition of DGC 
might not produce any more incompatible behavior than that produced by 
the encoding of additional characters.

On the other hand, it does require implementations to change their 
algorithms and data tables (beyond just importing a new 
UnicodeData.txt). I think my preference would be to make XDGC into DGC 
and then define the existing DGC as a "compatibility" or outdated variant.

4. Section 3 (editorial). Just following the Note: "A key feature... are"

5. Section 3 (editorial). The examples for locale-specific tailorings 
are in a single run-on-like sentence and probably should be separated 
around the text: "...such as collation; Thai never breaks between..."

6. Section 3 (editorial?). Under the heading "Grapheme Cluster Boundary 
Rules", the text refers to a rule "9b", but no such rule exists. This 
appears to mean rule 9a. Note that no change bars are present here!

7. Section 4 intro (editorial). The added text about search engines, 
coupled with the somewhat obscure example about database queries 
suggests some more general rewriting is needed here.

8. Section 4 intro. All of the examples include space-separated 
languages. No mention is made of the fact that some languages don't use 
spaces between words, which I think is an extremely important point to 
make. It should be explicitly mentioned here and possibly an example given.

9. Section 4 (note at end). The problem with spaces in tailored word 
breaking should probably be added to the text. In particular, it should 
be pointed out (as with the Southeast Asian languages above) that the 
word break algorithm provides a "pretty good" default but that some more 
complex mechanisms may be needed to do a perfect job (with stuff like 
1_234,56, where _ represents a space type character).



-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.
Received on Tuesday, 22 January 2008 17:51:50 UTC