W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

RE: [UAX29] i18n comment 1: Grapheme terminology

From: Richard Ishida <ishida@w3.org>
Date: Fri, 7 Mar 2008 14:12:53 -0000
To: <public-i18n-core@w3.org>
Message-ID: <005501c8805d$55980880$00c81980$@org>

New text is MUCH much better!  Eliminated default as part of a name,
highlighted the terms, use Grapheme Cluster for the general case, and
Extended Grapheme Cluster and Legacy Grapheme Cluster for the subtypes, and
used general term appropriately, not as short form.  User-perceived
character used consistently and defined clearly as a separate thing from a
grapheme cluster.

Last sentence in para 4 of section 3.0: clusters -> cluster

I think section 1 para 4 should say "…significant boundaries in text:
user-perceived characters, words, …"

Is it worth saying, in the initial setup, that there are *3* types of
grapheme cluster: legacy GC, extended GC, and tailored GC ?  Since that's
really the division.  This may be a slightly different way of seeing the
world compared to that in the note near the end of 3.0, but I think it makes
sense.  In fact, it has already been done in table 1a.

I would suggest that the para that begins "Grapheme clusters can be tailored
to meet further requirements." could be changed to mirror earlier text with
"A *tailored grapheme cluster* uses customizations of the Unicode rules to
meet further requirements."


Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)


> -----Original Message-----
> From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> request@w3.org] On Behalf Of ishida@w3.org
> Sent: 07 March 2008 11:28
> To: public-i18n-core@w3.org
> Subject: [UAX29] i18n comment 1: Grapheme terminology
> Comment from the i18n review of:
> http://www.unicode.org/reports/tr29/tr29-12.html
> Comment 1
> At http://www.w3.org/International/reviews/0801-uax29/
> Editorial/substantive: E
> Tracked by: RI
> Location in reviewed document:
> 3 [http://www.unicode.org/reports/tr29/tr29-
> 12.html#Grapheme_Cluster_Boundaries]
> Comment:
> "To avoid ambiguity with the computer use of the term character, this is
> called a user-perceived character or a grapheme cluster.".
> Section 1 para 1 replaces 'grapheme clusters ("user-perceived
> characters")' with 'user-perceived characters', but should probably say
> 'grapheme clusters (also known as user-perceived characters)'.
> S1 para 4 replaces 'grapheme clusters (what end users usually think of as
> characters)' with just 'characters'. This is incorrect.
> S2 para1 deletes 'grapheme clusters' and leaves 'user-perceived
> characters'.
> Later we read:
> "Note: Default grapheme clusters have been referred to as"
> This could point to a problem with terminology. Is 'default grapheme
> clusters' meant to include default grapheme clusters of the extended and
> existing types? I would have thought so, but the meaning of the text is
> not clear. You'd need to say 'default grapheme clusters and extended
> default grapheme clusters' here to be clear (and elsewhere in the text,
> 4 paras later). We could rename the current 'default grapheme cluster' to
> 'minimal default grapheme cluster' and define 'default grapheme cluster'
> to refer to both the minimal and extended varieties, or you could simply
> use 'grapheme cluster' when you want to be non-specific.
> This is very inconsistent.
> We would like to see some rationalization of the terminology used
> throughout the section, and consistency in its application.
> Terms should be clearly defined, and only one term should be used for one
> concept. The definitions should be easy for the reader to locate visually,
> and compare. We suggest a mini-glossary internal to section 3 or links on
> terms to a glossary at the end of the document.
> In particular, the replacement of the term "grapheme cluster" with term
> "character", starting in the introduction and proceeding through the
> document, seems to fly in the face of standard Unicode terminology and
> produces a significant problem. The term "character", as usually
> understood in Unicode contexts, refers to a logical character i.e. a code
> point. By using the term interchangeably with "grapheme cluster", we
> introduce confusion.
Received on Friday, 7 March 2008 14:09:37 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:03 UTC