Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters from Richard Ishida on 2014-05-22 (www-international@w3.org from April to June 2014)

From: Richard Ishida <ishida@w3.org>
Date: Thu, 22 May 2014 19:09:40 +0100
To: www-international@w3.org
CC: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>
Message-ID: <537E3D64.6020801@w3.org>
Thank you for reworking section 1.3.1 in the latest editor's version 
(dated 20 March 2014).  It's great to move away from the vague 
definition of character that we had before, but I'm still concerned 
about the text for a couple of reasons.

One is that, as I mentioned already, it is not correct to say 'the 
"user-perceived character", also know as the grapheme cluster.'  The 
equivalent term for a user-percieved character is 'grapheme'.  The 
'grapheme cluster' is a unit derived from rules in Unicode to yield an 
*approximation* to a user-defined character.  Not all user-perceived 
characters are grapheme clusters.

Another is a worry whether we can really effectively split the world 
into semantically-perceived and visually-perceived characters - 
especially given the 'etc' that appears in the definition where we list 
appropriate operations for each. For example, are we sure that 
first-letter operations require semantically- rather than 
visually-perceived characters in all cases?  Where does cursor movement 
fit here? etc.

What about Arabic justification which may involve increasing word 
-internal 'gaps' that occur due to one glyph not joining with the 
following glyph. These are relevant units for justification of Arabic 
text, but they aren't user-perceived characters.

And what about the case where Indic script text units vary according to 
the font in use.  As I understand it, a text unit for wrapping or 
stretching in Devanagari can encompass a CvCVD (consonant, virama, 
consonant, vowel sign, diacritic) only if the font has glyphs to show 
this is a single visual unit (eg. ligatures, half-forms, special glyphs) 
and hides the virama. If the font is changed, such that the virama 
becomes visible, we are now dealing with two text units.  This 
font-specific behaviour for the same sequence of code points is a 
contextual difference that, I think, cuts across both the semantic- and 
visual- categories currently defined.

I think that actually all we may be trying to say is that the atomic 
unit of text for a particular operation may not be the same as for 
another, but that we start from a base of grapheme clusters and require 
the application to take into account variances and extensions of that as 
needed.  What if we simply talk in terms of vague 'typographic units', 
or 'text units', or some such, but describe up front how these can be 
different sequences of code points depending on the operation to be 
performed (ie. not try to define just two specific scenarios)?

To help with that, I propose the following text for section 1.3.1 to 
replace the 2nd paragraph and the DL list.


=====================================================
For text layout the appropriate atomic units of text may include more 
than one Unicode code point. Often these text units correspond to 
*graphemes*, ie. what a language user (as opposed to a computer 
programmer) considers to be a character or basic unit of the script. 
Unfortunately, the appropriate units may be different for the same 
sequence of Unicode codepoints according to the operation which is being 
performed, or according to the visual context. (For example, 
line-breaking and letter-spacing may interpret a sequence of Thai 
characters that include U+0E33 THAI CHARACTER SARA AM differently; or 
the behaviour of a conjunct consonant in a script such as Devanagari may 
depend on the font in use).

The Unicode specification defines various combinations of code points as 
forming *extended grapheme clusters*. This is an attempt to indicate 
what users perceive as characters, and the term is described in detail 
in the Unicode Technical Report: Text Boundaries [UAX29]. Much of the 
time this produces the necessary text units for layout, however it is 
only an approximation and in some cases, such as those mentioned above, 
additional rules need to be applied by the application to tailor the 
definition of the text unit appropriately for the context.

Applications need to be aware of the typographic rules that must be used 
to determine units of text for a given operation on a particular script, 
and apply them to achieve the appropriate segmentation of the text for 
that operation.
=====================================================



RI





PS: Btw, the definition of semantically-perceived character has a 
sentence that says that tailoring may be necessary.  Doesn't this also 
apply to visually-perceived characters (eg. in the Thai case)?








On 21/02/2014 13:53, Richard Ishida wrote:
> On the subject of grapheme clusters, rather than characters, may help to
> note the Unicode Standard definitions here:
>
> ====
> *Grapheme*. (1) A minimally distinctive unit of writing in the context
> of a particular writing system. For example, ‹b› and ‹d› are distinct
> graphemes in English writing systems because there exist distinct words
> like big and dig. Conversely, a lowercase italiform letter a and a
> lowercase Roman letter a are not distinct graphemes because no word is
> distinguished on the basis of these two different forms. (2) What a user
> thinks of as a character.
>
> *Grapheme Cluster*. The text between grapheme cluster boundaries as
> specified by Unicode Standard Annex #29, "Unicode Text Segmentation."
> (See definition D60 in Section 3.6, Combination.) A grapheme cluster
> represents a horizontally segmentable unit of text, consisting of some
> grapheme base (which may consist of a Korean syllable) together with any
> number of nonspacing marks applied to it.
> ======
>
> The text in the spec "A grapheme cluster is what a language user
> considers to be a character or a basic unit of the script." is
> incorrect. What a user considers to be a basic unit of the script is a
> grapheme.  A grapheme cluster is a construct with a specific desciption
> that tries to approximate to the user perceived graphemes (and signally
> fails in some contexts).
>
> If you want a vague term to refer to something that includes grapheme
> clusters and characters in the spec, why not use 'grapheme' rather than
> 'character'.
>
> RI
>
>
> On 24/01/2014 22:26, Phillips, Addison wrote:
>>> The definition of "grapheme cluster" in the Unicode Glossary defers
>>> to UAX 29,
>>> but the current revision (23) of that UAX doesn't actually have a formal
>>> definition of "grapheme cluster", except as a cover term for default
>>> grapheme
>>> clusters, extended grapheme clusters, and tailored grapheme clusters,
>>> which
>>> *are* defined.
>>>
>>> It does, however, introduce the informal term "user-perceived
>>> character", and
>>> says that grapheme clusters (by implication, of one of the above
>>> varieties) are an approximation to user-perceived characters.
>>
>> The specific quote I think you refer to is:
>>
>> --
>> It is important to recognize that what the user thinks of as a
>> "character"—a basic unit of a writing system for a language—may not be
>> just a single Unicode code point. Instead, that basic unit may be made
>> up of multiple Unicode code points. To avoid ambiguity with the
>> computer use of the term character, this is called a user-perceived
>> character. For example, “G” + acute-accent is a user-perceived
>> character: users think of it as a single character, yet is actually
>> represented by two Unicode code points. These user-perceived
>> characters are approximated by what is called a grapheme cluster,
>> which can be determined programmatically.
>> --
>>
>>>
>>> This seems to me like good terminology to follow.
>>>
>>
>> The challenge here is that Unicode (and CSS) both define the term
>> "character" to have a specific meaning equivalent to a Unicode
>> codepoint, i.e. the "computer use" of the term. CSS3 Text, however,
>> attempts to redefine and then use the term "character" to also mean a
>> "user-perceived character". The use of the word "character" after that
>> point is somewhat haphazard, leading to a number of problems in
>> understanding the spec. Our primary comment is that we'd prefer to see
>> a term other than (unadorned) "character" used where "user-perceived
>> character" is intended.
>>
>> I agree that we could use "user-perceived character" instead of
>> "grapheme cluster". My reservation about that is that a "grapheme
>> cluster" (of various flavors and stripes) can be "determined
>> programmatically", which is a consideration for implementation. If the
>> "user-perceived character" cannot be determined programmatically, it
>> is not possible to do much with it in terms of CSS. Hence, I think
>> using the [whatever] "grapheme cluster" terminology is useful here
>> because that is the unit that CSS will actually operate on in the
>> cases where "user-perceived character" is intended.
>>
>> The ending part of my comment (which grew out of WG discussion):
>>
>>>      ... Rather,  we should say that applications sometimes require
>>> additional
>>>      rules beyond the use of 'grapheme clusters' in order to handle
>>>      the typographic traditions of particular scripts.
>>
>> ... suggests that some scripts require "tailored grapheme clusters"
>> (we're aware of claims of Indic script or language requirements in
>> this regard) but for which there is no fully-defined tailoring to
>> point to.
>>
>> HTH,
>>
>> Addison
>>
>>
>
>
Received on Thursday, 22 May 2014 18:10:13 UTC