Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters from Richard Ishida on 2014-02-21 (www-international@w3.org from January to March 2014)

From: Richard Ishida <ishida@w3.org>
Date: Fri, 21 Feb 2014 13:53:33 +0000
CC: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <53075A5D.3010204@w3.org>
On the subject of grapheme clusters, rather than characters, may help to 
note the Unicode Standard definitions here:

====
*Grapheme*. (1) A minimally distinctive unit of writing in the context 
of a particular writing system. For example, ‹b› and ‹d› are distinct 
graphemes in English writing systems because there exist distinct words 
like big and dig. Conversely, a lowercase italiform letter a and a 
lowercase Roman letter a are not distinct graphemes because no word is 
distinguished on the basis of these two different forms. (2) What a user 
thinks of as a character.

*Grapheme Cluster*. The text between grapheme cluster boundaries as 
specified by Unicode Standard Annex #29, "Unicode Text Segmentation." 
(See definition D60 in Section 3.6, Combination.) A grapheme cluster 
represents a horizontally segmentable unit of text, consisting of some 
grapheme base (which may consist of a Korean syllable) together with any 
number of nonspacing marks applied to it.
======

The text in the spec "A grapheme cluster is what a language user 
considers to be a character or a basic unit of the script." is 
incorrect. What a user considers to be a basic unit of the script is a 
grapheme.  A grapheme cluster is a construct with a specific desciption 
that tries to approximate to the user perceived graphemes (and signally 
fails in some contexts).

If you want a vague term to refer to something that includes grapheme 
clusters and characters in the spec, why not use 'grapheme' rather than 
'character'.

RI


On 24/01/2014 22:26, Phillips, Addison wrote:
>> The definition of "grapheme cluster" in the Unicode Glossary defers to UAX 29,
>> but the current revision (23) of that UAX doesn't actually have a formal
>> definition of "grapheme cluster", except as a cover term for default grapheme
>> clusters, extended grapheme clusters, and tailored grapheme clusters, which
>> *are* defined.
>>
>> It does, however, introduce the informal term "user-perceived character", and
>> says that grapheme clusters (by implication, of one of the above
>> varieties) are an approximation to user-perceived characters.
>
> The specific quote I think you refer to is:
>
> --
> It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
> --
>
>>
>> This seems to me like good terminology to follow.
>>
>
> The challenge here is that Unicode (and CSS) both define the term "character" to have a specific meaning equivalent to a Unicode codepoint, i.e. the "computer use" of the term. CSS3 Text, however, attempts to redefine and then use the term "character" to also mean a "user-perceived character". The use of the word "character" after that point is somewhat haphazard, leading to a number of problems in understanding the spec. Our primary comment is that we'd prefer to see a term other than (unadorned) "character" used where "user-perceived character" is intended.
>
> I agree that we could use "user-perceived character" instead of "grapheme cluster". My reservation about that is that a "grapheme cluster" (of various flavors and stripes) can be "determined programmatically", which is a consideration for implementation. If the "user-perceived character" cannot be determined programmatically, it is not possible to do much with it in terms of CSS. Hence, I think using the [whatever] "grapheme cluster" terminology is useful here because that is the unit that CSS will actually operate on in the cases where "user-perceived character" is intended.
>
> The ending part of my comment (which grew out of WG discussion):
>
>>      ... Rather,  we should say that applications sometimes require additional
>>      rules beyond the use of 'grapheme clusters' in order to handle
>>      the typographic traditions of particular scripts.
>
> ... suggests that some scripts require "tailored grapheme clusters" (we're aware of claims of Indic script or language requirements in this regard) but for which there is no fully-defined tailoring to point to.
>
> HTH,
>
> Addison
>
>
Received on Friday, 21 February 2014 13:54:01 UTC