RE: [UAX29] i18n comment 8: Conjunct clusters from Richard Ishida on 2008-03-07 (public-i18n-core@w3.org from January to March 2008)

From: Richard Ishida <ishida@w3.org>
Date: Fri, 7 Mar 2008 14:19:22 -0000
To: <public-i18n-core@w3.org>
Message-ID: <005f01c8805e$3dc83030$b9589090$@org>
The added explanation about why conjunct clusters are not included is very
useful.  I gather from the text that aksaras can be split after a virama if
the conjunct glyphs do not interact visually (although that's not actually
explicitly described).

I still feel that the current definition may stop short of being generally
useful for some scripts.  For example, Khmer subjoined consonants are always
treated as subscripts, as far as I am aware.  The grapheme cluster concept
doesn't seem to be very useful for Khmer as it stands, but I think could be
extended for this script as it was for Thai and Lao and become more useful.
I suspect this may also be the case for Myanmar.

RI

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/International/
http://rishida.net/blog/
http://rishida.net/

 

> -----Original Message-----
> From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> request@w3.org] On Behalf Of ishida@w3.org
> Sent: 07 March 2008 11:34
> To: public-i18n-core@w3.org
> Subject: [UAX29] i18n comment 8: Conjunct clusters
> 
> 
> Comment from the i18n review of:
> http://www.unicode.org/reports/tr29/tr29-12.html
> 
> Comment 8
> At http://www.w3.org/International/reviews/0801-uax29/
> Editorial/substantive: S
> Tracked by: RI
> 
> Location in reviewed document:
> 3 [http://www.unicode.org/reports/tr29/tr29-
> 12.html#Grapheme_Cluster_Boundaries]
> 
> Comment:
> We don't think extending default grapheme clusters to just incorporate
> spacing marks goes far enough to actually providing better results for a
> very large proportion of the world's population. We feel that the Unicode
> TC should conduct further research on how to extend default grapheme
> clusters so that they incorporate the majority of indic and south-east
> asian syllables.
> 
> 
> Example: It is very common to have a sequence such as
> consonant+virama+consonant+vowel_sign, eg.
> 
> 
> 0938: स DEVANAGARI LETTER SA
> 
>  094D: ् DEVANAGARI SIGN VIRAMA
> 
>  0925: थ DEVANAGARI LETTER THA
> 
>  093F: ि DEVANAGARI VOWEL SIGN I
> 
> 
> See this as it would be rendered
> [http://www.w3.org/International/reviews/0601-css3-selectors/sthiti.gif].
> 
> 
> Without tailoring, the current rules would result in text wrapping the THA
> to the next line, or attempting to highlight only part of the conjunct.
> The basic unit for grapheme clusters for indic and south-east asian
> scripts is the syllable, and just addressing spacing marks will still
> leave you short of a useful solution.
> 
> 
> We would like the Unicode TC to investigate the possibility of adding a
> rule to say that a vowel killer character extends the grapheme cluster to
> any immediately adjacent base character and all its combining characters.
> 
> 
> We feel that introducing a definition of default grapheme clusters that
> addresses this issue will go a long way to helping ensure that
> implementers provide applications that can handle South Asian and South-
> East Asian scripts much better than now.
> 
> 
> We feel that extending default grapheme clusters to include only spacing
> marks may only complicate things further. We do not, however, feel that
> the extension of grapheme clusters should be abandoned.
> 
>
Received on Friday, 7 March 2008 14:16:06 UTC