W3C home > Mailing lists > Public > www-international@w3.org > October to December 2013

I18N-ISSUE-313: Definition of grapheme clusters [.prep-CSS3-text]

From: Internationalization Working Group Issue Tracker <sysbot+tracker@w3.org>
Date: Wed, 11 Dec 2013 11:40:43 +0000
Message-Id: <E1Vqi9j-00017U-Ve@stuart.w3.org>
To: www-international@w3.org
I18N-ISSUE-313: Definition of grapheme clusters [.prep-CSS3-text]

http://www.w3.org/International/track/issues/313

Raised by: Richard Ishida
On product: .prep-CSS3-text

1.3. Terminology
http://www.w3.org/TR/css3-text/#terms

"A grapheme cluster is what a language user considers to be a character or a basic unit of the script."
"The UA may further tailor the definition as required by typographical tradition."
Example 1

I think a grapheme cluster should be defined in the CSS spec as follows: A grapheme cluster is a sequence of characters as defined by the Unicode specification that should be treated as a unit for typographic processing. This generally approximates to what a language user considers to be a letter or basic unit of the script.

I don't think applications should redefine what a grapheme cluster is; that definition is established by the Unicode standard. Rather, we should say that applications sometimes require additional rules beyond the use of 'grapheme clusters' in order to handle the typographic traditions of particular scripts.

An appropriate example for this section of where further rules are needed is that of Devanagari syllables, where the grapheme cluster only includes part of the syllable. For an example, see the last picture on the page at http://rishida.net/docs/unicode-tutorial/part3#graphemes and the text below it. For most operations that rely on grapheme clusters, Devanagari needs additional rules to keep together the whole typographic syllable. This issue is relevant for a large proportion of complex scripts.

I think that the example of the Thai behaviour may be better as a note in the letter-space and justification sections, especially since I believe that the behaviour described is not relevant for line breaking and other operations.

It may be worth mentioning, also, that although the Thai examples show that U+0E33 THAI CHARACTER SARA AM needs to be decomposed first, the desired behaviour still relies on correct application of the standard grapheme cluster rules thereafter to ensure that the small circle resulting from the decomposition stays with the base character and other associated diacritics.  
Received on Wednesday, 11 December 2013 11:40:48 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:35 UTC