Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters

Thank you for your work on this. The i18n WG is now happy to close this 
issue.

RI


>> On 24/01/2014 22:26, Phillips, Addison wrote:
>>>> The definition of "grapheme cluster" in the Unicode Glossary defers
>>>> to UAX 29,
>>>> but the current revision (23) of that UAX doesn't actually have a
>>>> formal
>>>> definition of "grapheme cluster", except as a cover term for default
>>>> grapheme
>>>> clusters, extended grapheme clusters, and tailored grapheme clusters,
>>>> which
>>>> *are* defined.
>>>>
>>>> It does, however, introduce the informal term "user-perceived
>>>> character", and
>>>> says that grapheme clusters (by implication, of one of the above
>>>> varieties) are an approximation to user-perceived characters.
>>>
>>> The specific quote I think you refer to is:
>>>
>>> --
>>> It is important to recognize that what the user thinks of as a
>>> "character"—a basic unit of a writing system for a language—may not be
>>> just a single Unicode code point. Instead, that basic unit may be made
>>> up of multiple Unicode code points. To avoid ambiguity with the
>>> computer use of the term character, this is called a user-perceived
>>> character. For example, “G” + acute-accent is a user-perceived
>>> character: users think of it as a single character, yet is actually
>>> represented by two Unicode code points. These user-perceived
>>> characters are approximated by what is called a grapheme cluster,
>>> which can be determined programmatically.
>>> --
>>>
>>>>
>>>> This seems to me like good terminology to follow.
>>>>
>>>
>>> The challenge here is that Unicode (and CSS) both define the term
>>> "character" to have a specific meaning equivalent to a Unicode
>>> codepoint, i.e. the "computer use" of the term. CSS3 Text, however,
>>> attempts to redefine and then use the term "character" to also mean a
>>> "user-perceived character". The use of the word "character" after that
>>> point is somewhat haphazard, leading to a number of problems in
>>> understanding the spec. Our primary comment is that we'd prefer to see
>>> a term other than (unadorned) "character" used where "user-perceived
>>> character" is intended.
>>>
>>> I agree that we could use "user-perceived character" instead of
>>> "grapheme cluster". My reservation about that is that a "grapheme
>>> cluster" (of various flavors and stripes) can be "determined
>>> programmatically", which is a consideration for implementation. If the
>>> "user-perceived character" cannot be determined programmatically, it
>>> is not possible to do much with it in terms of CSS. Hence, I think
>>> using the [whatever] "grapheme cluster" terminology is useful here
>>> because that is the unit that CSS will actually operate on in the
>>> cases where "user-perceived character" is intended.
>>>
>>> The ending part of my comment (which grew out of WG discussion):
>>>
>>>>      ... Rather,  we should say that applications sometimes require
>>>> additional
>>>>      rules beyond the use of 'grapheme clusters' in order to handle
>>>>      the typographic traditions of particular scripts.
>>>
>>> ... suggests that some scripts require "tailored grapheme clusters"
>>> (we're aware of claims of Indic script or language requirements in
>>> this regard) but for which there is no fully-defined tailoring to
>>> point to.
>>>
>>> HTH,
>>>
>>> Addison
>>>
>>>
>>
>>
>
>

Received on Thursday, 7 August 2014 13:36:17 UTC