W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

Re: [UAX29] i18n comment 8: Conjunct clusters

From: Mark Davis <mark.davis@icu-project.org>
Date: Fri, 7 Mar 2008 08:52:06 -0800
Message-ID: <30b660a20803070852x7bfe054bl5cd7da8c8a76a740@mail.gmail.com>
To: "Richard Ishida" <ishida@w3.org>
Cc: public-i18n-core@w3.org
Yes, we can refine those in the future.

On Fri, Mar 7, 2008 at 8:46 AM, Richard Ishida <ishida@w3.org> wrote:

> I don't think we can fix this with wording in the UAX.
>
> It seems we would need to investigate whether it makes sense to treat
> Khmer
> and Myanmar as a script (like Thai and Lao) that merits exceptional rules
> for certain character combinations.  We'd also need to check whether other
> scripts can be addressed in a similar way.
>
> Would it make sense to expand the remit of extended grapheme clusters in a
> future version of this document (since I guess it may be a little late to
> get such work done for this iteration)?
>
> RI
>
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>
> http://www.w3.org/International/
> http://rishida.net/blog/
> http://rishida.net/
>
>
>
> > -----Original Message-----
> > From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> > request@w3.org] On Behalf Of Richard Ishida
> > Sent: 07 March 2008 14:19
> > To: public-i18n-core@w3.org
> > Subject: RE: [UAX29] i18n comment 8: Conjunct clusters
> >
> >
> > The added explanation about why conjunct clusters are not included is
> very
> > useful.  I gather from the text that aksaras can be split after a virama
> > if
> > the conjunct glyphs do not interact visually (although that's not
> actually
> > explicitly described).
> >
> > I still feel that the current definition may stop short of being
> generally
> > useful for some scripts.  For example, Khmer subjoined consonants are
> > always
> > treated as subscripts, as far as I am aware.  The grapheme cluster
> concept
> > doesn't seem to be very useful for Khmer as it stands, but I think could
> > be
> > extended for this script as it was for Thai and Lao and become more
> useful.
> > I suspect this may also be the case for Myanmar.
> >
> > RI
> >
> > ============
> > Richard Ishida
> > Internationalization Lead
> > W3C (World Wide Web Consortium)
> >
> > http://www.w3.org/International/
> > http://rishida.net/blog/
> > http://rishida.net/
> >
> >
> >
> > > -----Original Message-----
> > > From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> > > request@w3.org] On Behalf Of ishida@w3.org
> > > Sent: 07 March 2008 11:34
> > > To: public-i18n-core@w3.org
> > > Subject: [UAX29] i18n comment 8: Conjunct clusters
> > >
> > >
> > > Comment from the i18n review of:
> > > http://www.unicode.org/reports/tr29/tr29-12.html
> > >
> > > Comment 8
> > > At http://www.w3.org/International/reviews/0801-uax29/
> > > Editorial/substantive: S
> > > Tracked by: RI
> > >
> > > Location in reviewed document:
> > > 3 [http://www.unicode.org/reports/tr29/tr29-
> > > 12.html#Grapheme_Cluster_Boundaries]
> > >
> > > Comment:
> > > We don't think extending default grapheme clusters to just incorporate
> > > spacing marks goes far enough to actually providing better results for
> a
> > > very large proportion of the world's population. We feel that the
> > Unicode
> > > TC should conduct further research on how to extend default grapheme
> > > clusters so that they incorporate the majority of indic and south-east
> > > asian syllables.
> > >
> > >
> > > Example: It is very common to have a sequence such as
> > > consonant+virama+consonant+vowel_sign, eg.
> > >
> > >
> > > 0938: स DEVANAGARI LETTER SA
> > >
> > >  094D: ॠ DEVANAGARI SIGN VIRAMA
> > >
> > >  0925: थ DEVANAGARI LETTER THA
> > >
> > >  093F: ि DEVANAGARI VOWEL SIGN I
> > >
> > >
> > > See this as it would be rendered
> > >
> [http://www.w3.org/International/reviews/0601-css3-selectors/sthiti.gif].
> > >
> > >
> > > Without tailoring, the current rules would result in text wrapping the
> > THA
> > > to the next line, or attempting to highlight only part of the
> conjunct.
> > > The basic unit for grapheme clusters for indic and south-east asian
> > > scripts is the syllable, and just addressing spacing marks will still
> > > leave you short of a useful solution.
> > >
> > >
> > > We would like the Unicode TC to investigate the possibility of adding
> a
> > > rule to say that a vowel killer character extends the grapheme cluster
> > to
> > > any immediately adjacent base character and all its combining
> characters.
> > >
> > >
> > > We feel that introducing a definition of default grapheme clusters
> that
> > > addresses this issue will go a long way to helping ensure that
> > > implementers provide applications that can handle South Asian and
> South-
> > > East Asian scripts much better than now.
> > >
> > >
> > > We feel that extending default grapheme clusters to include only
> spacing
> > > marks may only complicate things further. We do not, however, feel
> that
> > > the extension of grapheme clusters should be abandoned.
> > >
> > >
> >
>
>
>


-- 
Mark
Received on Friday, 7 March 2008 16:52:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:53 GMT