- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Tue, 22 Jan 2008 09:51:36 -0800
- To: public-i18n-core@w3.org
All, My comments follow. ~Addison 1. The replacement of the term "grapheme cluster" with term "character", starting in the introduction and proceeding through the document, seems to fly in the face of standard Unicode terminology and produces a significant problem. The term "character", as usually understood in Unicode contexts, refers to a logical character i.e. a code point. By using the term interchangeably with "grapheme cluster", we introduce confusion. I grant that, in the introduction, the unfamiliar term "grapheme cluster" needs to be defined and its relationship to "user-perceived characters" spelled out. But the wholesale use of "character" is a bad choice. 2. Section 3 (editorial). The sentence starting "Historically, the Unicode Standard originally provided for grapheme clusters" is redundant. Either say "historically" or say "originally". 3. Section 3 editorial note. XDGC vs. DGC. The question is whether default grapheme cluster should be "redefined" to include the additional characters in an XDGC or whether the two should remain distinct. On the one hand, Unicode continues to add characters, including combining marks, so the definition of a DGC will change over time. So I could envision that adding existing characters to the definition of DGC might not produce any more incompatible behavior than that produced by the encoding of additional characters. On the other hand, it does require implementations to change their algorithms and data tables (beyond just importing a new UnicodeData.txt). I think my preference would be to make XDGC into DGC and then define the existing DGC as a "compatibility" or outdated variant. 4. Section 3 (editorial). Just following the Note: "A key feature... are" 5. Section 3 (editorial). The examples for locale-specific tailorings are in a single run-on-like sentence and probably should be separated around the text: "...such as collation; Thai never breaks between..." 6. Section 3 (editorial?). Under the heading "Grapheme Cluster Boundary Rules", the text refers to a rule "9b", but no such rule exists. This appears to mean rule 9a. Note that no change bars are present here! 7. Section 4 intro (editorial). The added text about search engines, coupled with the somewhat obscure example about database queries suggests some more general rewriting is needed here. 8. Section 4 intro. All of the examples include space-separated languages. No mention is made of the fact that some languages don't use spaces between words, which I think is an extremely important point to make. It should be explicitly mentioned here and possibly an example given. 9. Section 4 (note at end). The problem with spaces in tailored word breaking should probably be added to the text. In particular, it should be pointed out (as with the Southeast Asian languages above) that the word break algorithm provides a "pretty good" default but that some more complex mechanisms may be needed to do a perfect job (with stuff like 1_234,56, where _ represents a space type character). -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature.
Received on Tuesday, 22 January 2008 17:51:50 UTC