[charmod-norm] Matching Unicode characters that don't normalise together (#216) from r12a via GitHub on 2021-07-21 (public-i18n-archive@w3.org from July to September 2021)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Wed, 21 Jul 2021 16:54:39 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-949916515-1626886478-sysbot+gh@w3.org>

r12a has just created a new issue for https://github.com/w3c/charmod-norm:

== Matching Unicode characters that don't normalise together ==
Brahmi-derived and arabic script based orthographies have visual graphemes that look the same but have different underlying code points. Some of these are precomposed and decomposed pairs for which Unicode provides mappings – they are not a problem and are already covered by this document.

Unfortunately, there is a very prevalent other case, where different underlying code points produce the same visual output but are not canonically equivalent.

In some cases there is advice from the Unicode Standard about which approach is preferred, but there is no real way of enforcing that advice when users start writing their content. A simple example of this would be the Sinhala equivalence:

ආ U+0D86: SINHALA LETTER AAYANNA
and
අ U+0D85: SINHALA LETTER AYANNA + ා U+0DCF: SINHALA VOWEL SIGN AELA-PILLA

Unicode says that the 2-character approach should not be used, but users may still type it, and apparently often do do this kind of thing. In such a case, it may be useful for an application that is trying to match items to do some kind of additional normalisation, so that these things match. One could expect such normalisation based on visual similarity to have different rules per writing system, but there may even be different rules per orthography (ie. per language).

But there are many similar scenarios that are not warned against by the Unicode Standard, and often it can be difficult to know which character(s) to use for a given visual result. I have recently been documenting the orthography of Kashmiri and there are several examples of this, leading to different encodings in content such as Wikipedia or even script tutorials. One example is:

ۆ U+06C6: ARABIC LETTER OE
vs
وٚ U+0648 U+065A: ARABIC LETTER WAW, VOWEL SIGN SMALL V ABOVE

It so happens that Wikipedia and other sources tend to use the precomposed character rather than the sequence in this case. But there are several other letters where the sequence tends to be used, rather than the precomposed character. In some texts, both are used in the same content.

We could say that people should use the right code points, but in Kashmiri **it's not even clear which _are_ the 'right' character(s).**

This seems to be a case where, on an orthography-specific basis, either:
a. some standard needs to be developed that clarifies which characters should and should not be used, and fonts or input systems should police this, or
b. additional tailored normalisations should be performed by an application.

It's my expectation that, either due to de facto usage patterns, or due to simple encoding ambiguities, in some cases there will always be two different ways of writing the same thing that are not made equivalent by standard Unicode normalisation.

Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/216 using your GitHub account

--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 21 July 2021 16:54:41 UTC