W3C home > Mailing lists > Public > www-international@w3.org > January to March 2014

[css-text] I18N-ISSUE-313: Definition of grapheme clusters

From: Phillips, Addison <addison@lab126.com>
Date: Fri, 24 Jan 2014 18:19:21 +0000
To: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>
CC: www International <www-international@w3.org>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB517C8E88C@ex10-mbx-36009.ant.amazon.com>
State:
    OPEN WG Comment
Product:
    CSS3-text
Raised by:
    Richard Ishida
Opened on:
    2013-12-11
Description:
    1.3. Terminology
    http://www.w3.org/TR/css3-text/#terms


    "A grapheme cluster is what a language user considers to be a character or a basic unit of the script."
    "The UA may further tailor the definition as required by typographical tradition."
    Example 1

    I think a grapheme cluster should be defined in the CSS spec as follows: A grapheme cluster is a sequence of characters as defined by the Unicode specification that should be treated as a unit for typographic processing. This generally approximates to what a language user considers to be a letter or basic unit of the script.

    I don't think applications should redefine what a grapheme cluster is; that definition is established by the Unicode standard. Rather, we should say that applications sometimes require additional rules beyond the use of 'grapheme clusters' in order to handle the typographic traditions of particular scripts.

    An appropriate example for this section of where further rules are needed is that of Devanagari syllables, where the grapheme cluster only includes part of the syllable. For an example, see the last picture on the page at http://rishida.net/docs/unicode-tutorial/part3#graphemes and the text below it. For most operations that rely on grapheme clusters, Devanagari needs additional rules to keep together the whole typographic syllable. This issue is relevant for a large proportion of complex scripts.

    I think that the example of the Thai behaviour may be better as a note in the letter-space and justification sections, especially since I believe that the behaviour described is not relevant for line breaking and other operations.

    It may be worth mentioning, also, that although the Thai examples show that U+0E33 THAI CHARACTER SARA AM needs to be decomposed first, the desired behaviour still relies on correct application of the standard grapheme cluster rules thereafter to ensure that the small circle resulting from the decomposition stays with the base character and other associated diacritics.
Received on Friday, 24 January 2014 18:20:59 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:36 UTC