Re: Role of the font in deciding the cluster boundaries from സിബു on 2014-03-13 (public-i18n-indic@w3.org from January to March 2014)

From: സിബു <cibu@google.com>
Date: Thu, 13 Mar 2014 14:35:36 -0700
To: Richard Ishida <ishida@w3.org>
Cc: John Cowan <cowan@mercury.ccil.org>, indic <public-i18n-indic@w3.org>
Message-ID: <CAFPeFPLWVK-R2Rx-Ht9L9vnS6AMjgdZ+gZwG+oBX=SVNYFdpEQ@mail.gmail.com>
Thanks Richard for the detailed explanation on the difference between
grapheme clusters and graphemes. I have gone over you material at
http://rishida.net/docs/unicode-tutorial/part3#graphemes as well. That is
wonderful tutorial with many examples to make things clearer.

Now it feels like we need to come up with something in between graphemes
and grapheme clusters. Let me call it Explicit graphemes. It is a sub-unit
of a grapheme cluster. It can be either same as the grapheme cluster or
smaller than that. It's boundary inside a grapheme cluster is decided by an
explicit virama in its shaping.

Going thru your examples:

   1. two explicit grapheme: H-I-ANUSVARA, D-II
   2. two explicit grapheme: H-I, N-D-II
   3. three explicit grapheme: P, L-Y-AA, TTA
   4. (I don't have enough experience on Myanmar)
   5. two explicit grapheme: B-AA-ANUSVARA, L-AA
   6. five explicit grapheme: VISARGA, J-I, R-AA, K, S


Now each of the properties we consider - letter-spacing, first-letter
highlighting etc. - has to fall into one these 3 buckets.

Here is my straw man assignments to some of those properties:

Grapheme clusters: first letter highlighting
Explicit graphemes: letter spacing, vertical units
Graphemes: unfortunately I don't see any properties going here

BTW, so far I don't see any need for special treatment for Malayalam.


On 2014, മാർച്ച് 13 10:05 AM, Richard Ishida <ishida@w3.org> wrote:

> On 13/03/2014 15:08, John Cowan wrote:
>
>> Richard Ishida scripsit:
>>
>>  I think the important question is whether the whole conjunct should
>>> continue to be treated as a unit for first-letter styling, line
>>> breaking, vertical arrangements, etc, whether or not the conjunct is
>>> expressed using a visible virama (actually, in fact, whether the
>>> orthographic syllable continues to be the unit, since it may also
>>> include vowel signs and such).
>>>
>>
>> Well, that's precisely the point: given a default grapheme cluster,
>> you can't tell how many orthographic syllables it will require
>> unless you know what the font will do.
>>
>
> Let me try to be clearer.
>
> My current understanding is that the grapheme [note: grapheme, not
> grapheme cluster!] used for splitting text for first-letter highlighting,
> line-breaking, letter-spacing, and such operations in Indic scripts, is the
> 'orthographic syllable'.  This is different from the phonetic syllable and
> also very often different from a grapheme cluster.  It includes ([all the
> consonant characters from the beginning to the end of a conjunct] or [a
> vowel that starts a phonetic syllable]) plus any viramas, plus any
> following diacritics and vowel-signs.
>
> A grapheme cluster is a sequence of characters that fit the Unicode
> definition of grapheme cluster, and while those sequences vacuum up most
> combining characters, they don't cover full graphemes/orthographic
> syllables in many indic scripts.
>
> I'm not aware of any grapheme-clusters that are longer than user-perceived
> graphemes.
>
>
> Examples of such syllables in a script of the Indian subcontinent:
>
>
> [Example 1]
> In the word हिंदी, two syllables:
>
> [1]
> U+0939 DEVANAGARI LETTER HA
> U+093F DEVANAGARI VOWEL SIGN I
> U+0902 DEVANAGARI SIGN ANUSVARA
> (this is a grapheme cluster)
>
> [2]
> U+0926 DEVANAGARI LETTER DA
> U+0940 DEVANAGARI VOWEL SIGN II
> (this is a grapheme cluster)
>
>
> [Example 2]
> In the same word, spelled differently, हिन्दी, two (different) syllables:
>
> [1]
> U+0939 DEVANAGARI LETTER HA
> U+093F DEVANAGARI VOWEL SIGN I
> (this is a grapheme cluster)
>
> [2]
> U+0928 DEVANAGARI LETTER NA
> U+094D DEVANAGARI SIGN VIRAMA
> U+0926 DEVANAGARI LETTER DA
> U+0940 DEVANAGARI VOWEL SIGN II
> (this is NOT a grapheme cluster - it's two)
>
>
>
> [Example 3]
> The word  ফ্‌ল্যাট
>
> [1]
>   ‎09AB  BENGALI LETTER PHA
>   ‎09CD  BENGALI SIGN VIRAMA
>   ‎200C  ZERO WIDTH NON-JOINER
>   ‎09B2  BENGALI LETTER LA
>   ‎09CD  BENGALI SIGN VIRAMA
>   ‎09AF  BENGALI LETTER YA
>   ‎09BE  BENGALI VOWEL SIGN AA
> (this is NOT a grapheme cluster - i think it's three)
>
> [2]
>   ‎099F  BENGALI LETTER TTA
> (this is a grapheme cluster)
>
>
>
> My assumption is that these grapheme boundaries remain the same whether or
> not the font represents the sequences of conjunct characters as ligatures,
> special joining forms, or sequences with explicit viramas. I'm seeking
> confirmation of that.
>
> I'd be interested to know if  Malayalam's recent script reforms make it a
> special case. Certainly, my belief is that even if the sequence S-KHA (സ്ഖ)
> would be displayed with with explicit virama in a reformed script font and
> as a single unit in traditional script font, there is no difference in
> terms of *grapheme-clusters*, which work on a character-only basis.  Is
> there a difference in terms of user-percieved *graphemes*, such that a
> breaks are treated differently for ligature vs a non-ligated conjunct in
> contexts such as first-letter, line-break, letter-spacing, etc?
>
>
> ================================================
>
>
> [Example 4]
> Here's a Myanmar word အင်္ဂလန်  This is a SE Asian script rather than S
> Asian, so this is a little out of scope for the Indic Layout TF, but I
> thought it works the same. This example includes the kinzi that Andrew
> mentioned)
>
> [1]
>   ‎1021  MYANMAR LETTER A
> (this is a grapheme cluster)
>
> [2]
>   ‎1004  MYANMAR LETTER NGA
>   ‎103A  MYANMAR SIGN ASAT
>   ‎1039  MYANMAR SIGN VIRAMA
>   ‎1002  MYANMAR LETTER GA
>   ‎101C  MYANMAR LETTER LA
> (this is NOT a grapheme cluster)
>
> [3]
>   ‎1014  MYANMAR LETTER NA
>   ‎103A  MYANMAR SIGN ASAT
> (this is a grapheme cluster)
>
> Are these the expected break points in Myanmar for first-letter,
> line-break, letter-spacing, etc?
>
> =================================================
>
> There are certainly some situations in Indic scripts, btw, that I'm not
> sure about. For example:
>
> [Example 5]
> The word  বাংলা
>   ‎09AC  BENGALI LETTER BA
>   ‎09BE  BENGALI VOWEL SIGN AA
>   ‎0982  BENGALI SIGN ANUSVARA
>   ‎09B2  BENGALI LETTER LA
>   ‎09BE  BENGALI VOWEL SIGN AA
>
> Are there two or three graphemes? (ie. is the anusvara part of the first
> orthographic syllable?)
>
>
>
> [Example 6]
> The word ஃஜிராக்ஸ்
>
>   ‎0B83  TAMIL SIGN VISARGA
>   ‎0B9C  TAMIL LETTER JA
>   ‎0BBF  TAMIL VOWEL SIGN I
>   ‎0BB0  TAMIL LETTER RA
>   ‎0BBE  TAMIL VOWEL SIGN AA
>   ‎0B95  TAMIL LETTER KA
>   ‎0BCD  TAMIL SIGN VIRAMA
>   ‎0BB8  TAMIL LETTER SA
>   ‎0BCD  TAMIL SIGN VIRAMA
>
> Is the visarga a separate grapheme for things like letter-spacing and
> first-letter highlighting?  If it is, there are presumably two graphemes in
> this word (neither of which is a grapheme cluster).
>
>
Received on Thursday, 13 March 2014 21:36:26 UTC