W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

RE: [UAX29] i18n comment 1: Grapheme terminology

From: Richard Ishida <ishida@w3.org>
Date: Fri, 7 Mar 2008 16:40:17 -0000
To: "'Mark Davis'" <mark.davis@icu-project.org>
Cc: <public-i18n-core@w3.org>
Message-ID: <007e01c88071$ed692590$c83b70b0$@org>

Here are some concrete proposals for text change (most just copied from
below):

a. Last sentence in para 4 of section 3.0: clusters -> cluster

b. section 1 para 4 should say "…significant boundaries in text:
user-perceived characters, words, …"

c. Section 3 para 6, first sentence: I suggest
"These algorithms can be adapted to produce *tailored grapheme clusters* for
specific locales or other customizations, such as the contractions used in
collation tailoring tables. Below are some examples of the differences
between these concepts."

d. I would suggest that the para that begins "Grapheme clusters can be
tailored to meet further requirements." could be changed to mirror earlier
text with "A *tailored grapheme cluster* uses customizations of the Unicode
rules to meet further requirements."

RI
============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/International/
http://rishida.net/blog/
http://rishida.net/

 

> -----Original Message-----
> From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> request@w3.org] On Behalf Of Richard Ishida
> Sent: 07 March 2008 14:13
> To: public-i18n-core@w3.org
> Subject: RE: [UAX29] i18n comment 1: Grapheme terminology
> 
> 
> New text is MUCH much better!  Eliminated default as part of a name,
> highlighted the terms, use Grapheme Cluster for the general case, and
> Extended Grapheme Cluster and Legacy Grapheme Cluster for the subtypes,
> and
> used general term appropriately, not as short form.  User-perceived
> character used consistently and defined clearly as a separate thing from a
> grapheme cluster.
> 
> Last sentence in para 4 of section 3.0: clusters -> cluster
> 
> I think section 1 para 4 should say "…significant boundaries in text:
> user-perceived characters, words, …"
> 
> Is it worth saying, in the initial setup, that there are *3* types of
> grapheme cluster: legacy GC, extended GC, and tailored GC ?  Since that's
> really the division.  This may be a slightly different way of seeing the
> world compared to that in the note near the end of 3.0, but I think it
> makes
> sense.  In fact, it has already been done in table 1a.
> 
> I would suggest that the para that begins "Grapheme clusters can be
> tailored
> to meet further requirements." could be changed to mirror earlier text
> with
> "A *tailored grapheme cluster* uses customizations of the Unicode rules to
> meet further requirements."
> 
> RI
> 
> 
> 
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
> 
> http://www.w3.org/International/
> http://rishida.net/blog/
> http://rishida.net/
> 
> 
> 
> > -----Original Message-----
> > From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> > request@w3.org] On Behalf Of ishida@w3.org
> > Sent: 07 March 2008 11:28
> > To: public-i18n-core@w3.org
> > Subject: [UAX29] i18n comment 1: Grapheme terminology
> >
> >
> > Comment from the i18n review of:
> > http://www.unicode.org/reports/tr29/tr29-12.html
> >
> > Comment 1
> > At http://www.w3.org/International/reviews/0801-uax29/
> > Editorial/substantive: E
> > Tracked by: RI
> >
> > Location in reviewed document:
> > 3 [http://www.unicode.org/reports/tr29/tr29-
> > 12.html#Grapheme_Cluster_Boundaries]
> >
> > Comment:
> > "To avoid ambiguity with the computer use of the term character, this is
> > called a user-perceived character or a grapheme cluster.".
> >
> >
> > Section 1 para 1 replaces 'grapheme clusters ("user-perceived
> > characters")' with 'user-perceived characters', but should probably say
> > 'grapheme clusters (also known as user-perceived characters)'.
> >
> >
> > S1 para 4 replaces 'grapheme clusters (what end users usually think of
> as
> > characters)' with just 'characters'. This is incorrect.
> >
> >
> > S2 para1 deletes 'grapheme clusters' and leaves 'user-perceived
> > characters'.
> >
> >
> > Later we read:
> >
> >
> > "Note: Default grapheme clusters have been referred to as"
> >
> >
> > This could point to a problem with terminology. Is 'default grapheme
> > clusters' meant to include default grapheme clusters of the extended and
> > existing types? I would have thought so, but the meaning of the text is
> > not clear. You'd need to say 'default grapheme clusters and extended
> > default grapheme clusters' here to be clear (and elsewhere in the text,
> eg.
> > 4 paras later). We could rename the current 'default grapheme cluster'
> to
> > 'minimal default grapheme cluster' and define 'default grapheme cluster'
> > to refer to both the minimal and extended varieties, or you could simply
> > use 'grapheme cluster' when you want to be non-specific.
> >
> >
> > This is very inconsistent.
> >
> >
> > We would like to see some rationalization of the terminology used
> > throughout the section, and consistency in its application.
> >
> >
> > Terms should be clearly defined, and only one term should be used for
> one
> > concept. The definitions should be easy for the reader to locate
> visually,
> > and compare. We suggest a mini-glossary internal to section 3 or links
> on
> > terms to a glossary at the end of the document.
> >
> >
> > In particular, the replacement of the term "grapheme cluster" with term
> > "character", starting in the introduction and proceeding through the
> > document, seems to fly in the face of standard Unicode terminology and
> > produces a significant problem. The term "character", as usually
> > understood in Unicode contexts, refers to a logical character i.e. a
> code
> > point. By using the term interchangeably with "grapheme cluster", we
> > introduce confusion.
> >
> >
> 
Received on Friday, 7 March 2008 16:37:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:53 GMT