- From: സിബു <cibu@google.com>
- Date: Thu, 13 Mar 2014 14:35:36 -0700
- To: Richard Ishida <ishida@w3.org>
- Cc: John Cowan <cowan@mercury.ccil.org>, indic <public-i18n-indic@w3.org>
- Message-ID: <CAFPeFPLWVK-R2Rx-Ht9L9vnS6AMjgdZ+gZwG+oBX=SVNYFdpEQ@mail.gmail.com>
Thanks Richard for the detailed explanation on the difference between grapheme clusters and graphemes. I have gone over you material at http://rishida.net/docs/unicode-tutorial/part3#graphemes as well. That is wonderful tutorial with many examples to make things clearer. Now it feels like we need to come up with something in between graphemes and grapheme clusters. Let me call it Explicit graphemes. It is a sub-unit of a grapheme cluster. It can be either same as the grapheme cluster or smaller than that. It's boundary inside a grapheme cluster is decided by an explicit virama in its shaping. Going thru your examples: 1. two explicit grapheme: H-I-ANUSVARA, D-II 2. two explicit grapheme: H-I, N-D-II 3. three explicit grapheme: P, L-Y-AA, TTA 4. (I don't have enough experience on Myanmar) 5. two explicit grapheme: B-AA-ANUSVARA, L-AA 6. five explicit grapheme: VISARGA, J-I, R-AA, K, S Now each of the properties we consider - letter-spacing, first-letter highlighting etc. - has to fall into one these 3 buckets. Here is my straw man assignments to some of those properties: Grapheme clusters: first letter highlighting Explicit graphemes: letter spacing, vertical units Graphemes: unfortunately I don't see any properties going here BTW, so far I don't see any need for special treatment for Malayalam. On 2014, മാർച്ച് 13 10:05 AM, Richard Ishida <ishida@w3.org> wrote: > On 13/03/2014 15:08, John Cowan wrote: > >> Richard Ishida scripsit: >> >> I think the important question is whether the whole conjunct should >>> continue to be treated as a unit for first-letter styling, line >>> breaking, vertical arrangements, etc, whether or not the conjunct is >>> expressed using a visible virama (actually, in fact, whether the >>> orthographic syllable continues to be the unit, since it may also >>> include vowel signs and such). >>> >> >> Well, that's precisely the point: given a default grapheme cluster, >> you can't tell how many orthographic syllables it will require >> unless you know what the font will do. >> > > Let me try to be clearer. > > My current understanding is that the grapheme [note: grapheme, not > grapheme cluster!] used for splitting text for first-letter highlighting, > line-breaking, letter-spacing, and such operations in Indic scripts, is the > 'orthographic syllable'. This is different from the phonetic syllable and > also very often different from a grapheme cluster. It includes ([all the > consonant characters from the beginning to the end of a conjunct] or [a > vowel that starts a phonetic syllable]) plus any viramas, plus any > following diacritics and vowel-signs. > > A grapheme cluster is a sequence of characters that fit the Unicode > definition of grapheme cluster, and while those sequences vacuum up most > combining characters, they don't cover full graphemes/orthographic > syllables in many indic scripts. > > I'm not aware of any grapheme-clusters that are longer than user-perceived > graphemes. > > > Examples of such syllables in a script of the Indian subcontinent: > > > [Example 1] > In the word हिंदी, two syllables: > > [1] > U+0939 DEVANAGARI LETTER HA > U+093F DEVANAGARI VOWEL SIGN I > U+0902 DEVANAGARI SIGN ANUSVARA > (this is a grapheme cluster) > > [2] > U+0926 DEVANAGARI LETTER DA > U+0940 DEVANAGARI VOWEL SIGN II > (this is a grapheme cluster) > > > [Example 2] > In the same word, spelled differently, हिन्दी, two (different) syllables: > > [1] > U+0939 DEVANAGARI LETTER HA > U+093F DEVANAGARI VOWEL SIGN I > (this is a grapheme cluster) > > [2] > U+0928 DEVANAGARI LETTER NA > U+094D DEVANAGARI SIGN VIRAMA > U+0926 DEVANAGARI LETTER DA > U+0940 DEVANAGARI VOWEL SIGN II > (this is NOT a grapheme cluster - it's two) > > > > [Example 3] > The word ফ্ল্যাট > > [1] > 09AB BENGALI LETTER PHA > 09CD BENGALI SIGN VIRAMA > 200C ZERO WIDTH NON-JOINER > 09B2 BENGALI LETTER LA > 09CD BENGALI SIGN VIRAMA > 09AF BENGALI LETTER YA > 09BE BENGALI VOWEL SIGN AA > (this is NOT a grapheme cluster - i think it's three) > > [2] > 099F BENGALI LETTER TTA > (this is a grapheme cluster) > > > > My assumption is that these grapheme boundaries remain the same whether or > not the font represents the sequences of conjunct characters as ligatures, > special joining forms, or sequences with explicit viramas. I'm seeking > confirmation of that. > > I'd be interested to know if Malayalam's recent script reforms make it a > special case. Certainly, my belief is that even if the sequence S-KHA (സ്ഖ) > would be displayed with with explicit virama in a reformed script font and > as a single unit in traditional script font, there is no difference in > terms of *grapheme-clusters*, which work on a character-only basis. Is > there a difference in terms of user-percieved *graphemes*, such that a > breaks are treated differently for ligature vs a non-ligated conjunct in > contexts such as first-letter, line-break, letter-spacing, etc? > > > ================================================ > > > [Example 4] > Here's a Myanmar word အင်္ဂလန် This is a SE Asian script rather than S > Asian, so this is a little out of scope for the Indic Layout TF, but I > thought it works the same. This example includes the kinzi that Andrew > mentioned) > > [1] > 1021 MYANMAR LETTER A > (this is a grapheme cluster) > > [2] > 1004 MYANMAR LETTER NGA > 103A MYANMAR SIGN ASAT > 1039 MYANMAR SIGN VIRAMA > 1002 MYANMAR LETTER GA > 101C MYANMAR LETTER LA > (this is NOT a grapheme cluster) > > [3] > 1014 MYANMAR LETTER NA > 103A MYANMAR SIGN ASAT > (this is a grapheme cluster) > > Are these the expected break points in Myanmar for first-letter, > line-break, letter-spacing, etc? > > ================================================= > > There are certainly some situations in Indic scripts, btw, that I'm not > sure about. For example: > > [Example 5] > The word বাংলা > 09AC BENGALI LETTER BA > 09BE BENGALI VOWEL SIGN AA > 0982 BENGALI SIGN ANUSVARA > 09B2 BENGALI LETTER LA > 09BE BENGALI VOWEL SIGN AA > > Are there two or three graphemes? (ie. is the anusvara part of the first > orthographic syllable?) > > > > [Example 6] > The word ஃஜிராக்ஸ் > > 0B83 TAMIL SIGN VISARGA > 0B9C TAMIL LETTER JA > 0BBF TAMIL VOWEL SIGN I > 0BB0 TAMIL LETTER RA > 0BBE TAMIL VOWEL SIGN AA > 0B95 TAMIL LETTER KA > 0BCD TAMIL SIGN VIRAMA > 0BB8 TAMIL LETTER SA > 0BCD TAMIL SIGN VIRAMA > > Is the visarga a separate grapheme for things like letter-spacing and > first-letter highlighting? If it is, there are presumably two graphemes in > this word (neither of which is a grapheme cluster). > >
Received on Thursday, 13 March 2014 21:36:26 UTC