[bp-i18n-specdev] Example of bengali grapheme clusters out fo data (#150) from Andj via GitHub on 2025-01-29 (public-i18n-archive@w3.org from January to March 2025)

From: Andj via GitHub <sysbot+gh@w3.org>
Date: Wed, 29 Jan 2025 10:14:46 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-2817771604-1738145683-sysbot+gh@w3.org>

andjc has just created a new issue for https://github.com/w3c/bp-i18n-specdev:

== Example of bengali grapheme clusters out fo data ==
The current editors draft has the following text:

>For example, the Bangla user-perceived character kshī ক্ষী is composed of four characters: U+0995 BENGALI LETTER KA + U+09CD BENGALI SIGN VIRAMA + U+09B7 BENGALI LETTER SSA + U+09C0 BENGALI VOWEL SIGN II.
>Unicode splits these into two grapheme clusters, unless language-specific tailoring is applied. For more information, see our article [Character encodings: Essential concepts](https://www.w3.org/International/articles/definitions-characters/index.en.html#characters).

This describes the behavior prior to Unicode 15.1. UAX29 was updated in the Unicode 15.1  release, adding an additional rule [GB9c](https://www.unicode.org/reports/tr29/tr29-43.html#GB9c):

>Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker

For the example 'ক্ষী' , UAX29 revision 41 and earlier would result in two extended grapheme clusters ('ক্', 'ষী') while UAX29 revision 43 onwards results in a single extended grapheme cluster ('ক্ষী'). So behaviour is dependent on version of UAX29 (i.e. version of Unicode supported).

Please view or discuss this issue at https://github.com/w3c/bp-i18n-specdev/issues/150 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 29 January 2025 10:14:47 UTC