W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

[UAX29] i18n comment 8: Conjunct clusters

From: <ishida@w3.org>
Date: Fri, 07 Mar 2008 11:33:37 +0000
To: public-i18n-core@w3.org
Message-Id: <20080307113010.E27294F118@homer.w3.org>

Comment from the i18n review of:
http://www.unicode.org/reports/tr29/tr29-12.html

Comment 8
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: S
Tracked by: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment: 
We don't think extending default grapheme clusters to just incorporate spacing marks goes far enough to actually providing better results for a very large proportion of the world's population. We feel that the Unicode TC should conduct further research on how to extend default grapheme clusters so that they incorporate the majority of indic and south-east asian syllables. 

 
Example: It is very common to have a sequence such as consonant+virama+consonant+vowel_sign, eg. 

 
0938: स DEVANAGARI LETTER SA

 094D: ् DEVANAGARI SIGN VIRAMA

 0925: थ DEVANAGARI LETTER THA

 093F: ि DEVANAGARI VOWEL SIGN I

 
See this as it would be rendered [http://www.w3.org/International/reviews/0601-css3-selectors/sthiti.gif]. 

 
Without tailoring, the current rules would result in text wrapping the THA to the next line, or attempting to highlight only part of the conjunct. The basic unit for grapheme clusters for indic and south-east asian scripts is the syllable, and just addressing spacing marks will still leave you short of a useful solution.

 
We would like the Unicode TC to investigate the possibility of adding a rule to say that a vowel killer character extends the grapheme cluster to any immediately adjacent base character and all its combining characters. 

 
We feel that introducing a definition of default grapheme clusters that addresses this issue will go a long way to helping ensure that implementers provide applications that can handle South Asian and South-East Asian scripts much better than now.

 
We feel that extending default grapheme clusters to include only spacing marks may only complicate things further. We do not, however, feel that the extension of grapheme clusters should be abandoned. 

 
Received on Friday, 7 March 2008 11:30:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:53 GMT