RE: [css-text] I18N-ISSUE-313: Definition of grapheme clusters

> The definition of "grapheme cluster" in the Unicode Glossary defers to UAX 29,
> but the current revision (23) of that UAX doesn't actually have a formal
> definition of "grapheme cluster", except as a cover term for default grapheme
> clusters, extended grapheme clusters, and tailored grapheme clusters, which
> *are* defined.
> 
> It does, however, introduce the informal term "user-perceived character", and
> says that grapheme clusters (by implication, of one of the above
> varieties) are an approximation to user-perceived characters.

The specific quote I think you refer to is:

--
It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
--

> 
> This seems to me like good terminology to follow.
> 

The challenge here is that Unicode (and CSS) both define the term "character" to have a specific meaning equivalent to a Unicode codepoint, i.e. the "computer use" of the term. CSS3 Text, however, attempts to redefine and then use the term "character" to also mean a "user-perceived character". The use of the word "character" after that point is somewhat haphazard, leading to a number of problems in understanding the spec. Our primary comment is that we'd prefer to see a term other than (unadorned) "character" used where "user-perceived character" is intended.

I agree that we could use "user-perceived character" instead of "grapheme cluster". My reservation about that is that a "grapheme cluster" (of various flavors and stripes) can be "determined programmatically", which is a consideration for implementation. If the "user-perceived character" cannot be determined programmatically, it is not possible to do much with it in terms of CSS. Hence, I think using the [whatever] "grapheme cluster" terminology is useful here because that is the unit that CSS will actually operate on in the cases where "user-perceived character" is intended.

The ending part of my comment (which grew out of WG discussion):

>     ... Rather,  we should say that applications sometimes require additional
>     rules beyond the use of 'grapheme clusters' in order to handle
>     the typographic traditions of particular scripts.

... suggests that some scripts require "tailored grapheme clusters" (we're aware of claims of Indic script or language requirements in this regard) but for which there is no fully-defined tailoring to point to.

HTH,

Addison

Received on Friday, 24 January 2014 22:27:21 UTC