Re: Role of the font in deciding the cluster boundaries from Richard Ishida on 2014-03-13 (public-i18n-indic@w3.org from January to March 2014)

From: Richard Ishida <ishida@w3.org>
Date: Thu, 13 Mar 2014 17:05:59 +0000
To: John Cowan <cowan@mercury.ccil.org>, public-i18n-indic@w3.org
Message-ID: <5321E577.5040202@w3.org>
On 13/03/2014 15:08, John Cowan wrote:
> Richard Ishida scripsit:
>
>> I think the important question is whether the whole conjunct should
>> continue to be treated as a unit for first-letter styling, line
>> breaking, vertical arrangements, etc, whether or not the conjunct is
>> expressed using a visible virama (actually, in fact, whether the
>> orthographic syllable continues to be the unit, since it may also
>> include vowel signs and such).
>
> Well, that's precisely the point: given a default grapheme cluster,
> you can't tell how many orthographic syllables it will require
> unless you know what the font will do.

Let me try to be clearer.

My current understanding is that the grapheme [note: grapheme, not 
grapheme cluster!] used for splitting text for first-letter 
highlighting, line-breaking, letter-spacing, and such operations in 
Indic scripts, is the 'orthographic syllable'.  This is different from 
the phonetic syllable and also very often different from a grapheme 
cluster.  It includes ([all the consonant characters from the beginning 
to the end of a conjunct] or [a vowel that starts a phonetic syllable]) 
plus any viramas, plus any following diacritics and vowel-signs.

A grapheme cluster is a sequence of characters that fit the Unicode 
definition of grapheme cluster, and while those sequences vacuum up most 
combining characters, they don't cover full graphemes/orthographic 
syllables in many indic scripts.

I'm not aware of any grapheme-clusters that are longer than 
user-perceived graphemes.


Examples of such syllables in a script of the Indian subcontinent:


[Example 1]
In the word हिंदी, two syllables:

[1]
U+0939 DEVANAGARI LETTER HA
U+093F DEVANAGARI VOWEL SIGN I
U+0902 DEVANAGARI SIGN ANUSVARA
(this is a grapheme cluster)

[2]
U+0926 DEVANAGARI LETTER DA
U+0940 DEVANAGARI VOWEL SIGN II
(this is a grapheme cluster)


[Example 2]
In the same word, spelled differently, हिन्दी, two (different) syllables:

[1]
U+0939 DEVANAGARI LETTER HA
U+093F DEVANAGARI VOWEL SIGN I
(this is a grapheme cluster)

[2]
U+0928 DEVANAGARI LETTER NA
U+094D DEVANAGARI SIGN VIRAMA
U+0926 DEVANAGARI LETTER DA
U+0940 DEVANAGARI VOWEL SIGN II
(this is NOT a grapheme cluster - it's two)



[Example 3]
The word  ফ্‌ল্যাট

[1]
   ‎09AB  BENGALI LETTER PHA
   ‎09CD  BENGALI SIGN VIRAMA
   ‎200C  ZERO WIDTH NON-JOINER
   ‎09B2  BENGALI LETTER LA
   ‎09CD  BENGALI SIGN VIRAMA
   ‎09AF  BENGALI LETTER YA
   ‎09BE  BENGALI VOWEL SIGN AA
(this is NOT a grapheme cluster - i think it's three)

[2]
   ‎099F  BENGALI LETTER TTA
(this is a grapheme cluster)



My assumption is that these grapheme boundaries remain the same whether 
or not the font represents the sequences of conjunct characters as 
ligatures, special joining forms, or sequences with explicit viramas. 
I'm seeking confirmation of that.

I'd be interested to know if  Malayalam's recent script reforms make it 
a special case. Certainly, my belief is that even if the sequence S-KHA 
(സ്ഖ) would be displayed with with explicit virama in a reformed script 
font and as a single unit in traditional script font, there is no 
difference in terms of *grapheme-clusters*, which work on a 
character-only basis.  Is there a difference in terms of user-percieved 
*graphemes*, such that a breaks are treated differently for ligature vs 
a non-ligated conjunct in contexts such as first-letter, line-break, 
letter-spacing, etc?


================================================


[Example 4]
Here's a Myanmar word အင်္ဂလန်  This is a SE Asian script rather than S 
Asian, so this is a little out of scope for the Indic Layout TF, but I 
thought it works the same. This example includes the kinzi that Andrew 
mentioned)

[1]
   ‎1021  MYANMAR LETTER A
(this is a grapheme cluster)

[2]
   ‎1004  MYANMAR LETTER NGA
   ‎103A  MYANMAR SIGN ASAT
   ‎1039  MYANMAR SIGN VIRAMA
   ‎1002  MYANMAR LETTER GA
   ‎101C  MYANMAR LETTER LA
(this is NOT a grapheme cluster)

[3]
   ‎1014  MYANMAR LETTER NA
   ‎103A  MYANMAR SIGN ASAT
(this is a grapheme cluster)

Are these the expected break points in Myanmar for first-letter, 
line-break, letter-spacing, etc?

=================================================

There are certainly some situations in Indic scripts, btw, that I'm not 
sure about. For example:

[Example 5]
The word  বাংলা
   ‎09AC  BENGALI LETTER BA
   ‎09BE  BENGALI VOWEL SIGN AA
   ‎0982  BENGALI SIGN ANUSVARA
   ‎09B2  BENGALI LETTER LA
   ‎09BE  BENGALI VOWEL SIGN AA

Are there two or three graphemes? (ie. is the anusvara part of the first 
orthographic syllable?)



[Example 6]
The word ஃஜிராக்ஸ்

   ‎0B83  TAMIL SIGN VISARGA
   ‎0B9C  TAMIL LETTER JA
   ‎0BBF  TAMIL VOWEL SIGN I
   ‎0BB0  TAMIL LETTER RA
   ‎0BBE  TAMIL VOWEL SIGN AA
   ‎0B95  TAMIL LETTER KA
   ‎0BCD  TAMIL SIGN VIRAMA
   ‎0BB8  TAMIL LETTER SA
   ‎0BCD  TAMIL SIGN VIRAMA

Is the visarga a separate grapheme for things like letter-spacing and 
first-letter highlighting?  If it is, there are presumably two graphemes 
in this word (neither of which is a grapheme cluster).
Received on Thursday, 13 March 2014 17:06:28 UTC