[ilreq] Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions (#43) from Andj via GitHub on 2025-02-03 (public-i18n-archive@w3.org from January to March 2025)

From: Andj via GitHub <sysbot+gh@w3.org>
Date: Mon, 03 Feb 2025 21:33:48 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-2828573244-1738618425-sysbot+gh@w3.org>

andjc has just created a new issue for https://github.com/w3c/ilreq:

== Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions ==
In the section on [typographic units](https://w3c.github.io/ilreq/#typographic-units) there is a discussion on extended grapheme clusters, using __स्कूल__ as an example. The text says:

>There are two syllables in this word: SA+VIRAMA+KA+UU and LA. Note, however, that there are three Unicode grapheme clusters here: SA+VIRAMA, KA+UU and LA.
>
>Styling is done on the basis of the whole orthographic syllable, not the first character, nor even the first grapheme. 

Unicode 15.1, UAX #29 added a new rule specifically for some Indic scripts:

>[GB9c](https://www.unicode.org/reports/tr29/#GB9c) rule only applies to extended grapheme clusters:
Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.

So the following characters:

```
                                Character properties                                
┌──────┬──────┬────────────────────────┬────────────┬────────────┬─────┬──────┬────┐
│ char │ cp   │ name                   │ script     │ block      │ cat │ bidi │ cc │
├──────┼──────┼────────────────────────┼────────────┼────────────┼─────┼──────┼────┤
│ ्     │ 094D │ DEVANAGARI SIGN VIRAMA │ Devanagari │ Devanagari │ Mn  │ NSM  │ 9  │
│ ্     │ 09CD │ BENGALI SIGN VIRAMA    │ Bengali    │ Bengali    │ Mn  │ NSM  │ 9  │
│ ્     │ 0ACD │ GUJARATI SIGN VIRAMA   │ Gujarati   │ Gujarati   │ Mn  │ NSM  │ 9  │
│ ୍     │ 0B4D │ ORIYA SIGN VIRAMA      │ Oriya      │ Oriya      │ Mn  │ NSM  │ 9  │
│ ్     │ 0C4D │ TELUGU SIGN VIRAMA     │ Telugu     │ Telugu     │ Mn  │ NSM  │ 9  │
│ ്     │ 0D4D │ MALAYALAM SIGN VIRAMA  │ Malayalam  │ Malayalam  │ Mn  │ NSM  │ 9  │
└──────┴──────┴────────────────────────┴────────────┴────────────┴─────┴──────┴────┘
                             String: [\p{InCB=Linker}]      
```

can now extend a grapheme cluster.

So __स्कूल__ will be three extended grapheme clusters (['स्', 'कू', 'ल'] &ndash;  SA+VIRAMA, KA+UU and LA) in Unicode 15.0 and prior versions, and two extended grapheme clusters (['स्कू', 'ल'] &ndash; SA+VIRAMA+KA+UU and LA) in Unicode 15.1 onwards.

So the effect of extended grapheme cluster level segmentation will depend on the Version of Unicode the toolchain is using at the pint of segentation.

Please view or discuss this issue at https://github.com/w3c/ilreq/issues/43 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Monday, 3 February 2025 21:33:49 UTC