Re: [charmod-norm] Matching Unicode characters that don't normalise together (#216)

Here's another example, which occurs in multiple orthographies.  Let's consider Persian.

Persian (and Urdu, Kashmiri, etc.) uses <span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x06CC;</span> [<span class="uname">U+06CC ARABIC LETTER FARSI YEH</span>]</span> for 'yeh'.  It doesn't use <span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x064A;</span> [<span class="uname">U+064A ARABIC LETTER YEH</span>]</span>, because there are differences in the glyphs for certain joining forms.

However, Persian sometimes uses a hamza diacritic above yeh.  The Unicode Standard explains that a combining hamza should be used rather than a precomposed character.  However, many documents use <span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x0626;</span> [<span class="uname">U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE</span>]</span>

The problem with this is that
<span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x0626;</span> [<span class="uname">U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE</span>]</span>
decomposes to
<span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x064A;&#x0654;</span> [<span class="uname">U+064A ARABIC LETTER YEH</span> + <span class="uname">U+0654 ARABIC HAMZA ABOVE</span>]</span>
which produces the Arabic yeh rather than the Persian one.

So in Persian text, the following should be treated as equivalent by an application, even though they are not equivalent in normalisation:

<span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x06CC;&#x0654;</span> [<span class="uname">U+06CC ARABIC LETTER FARSI YEH</span> + <span class="uname">U+0654 ARABIC HAMZA ABOVE</span>]</span>
<span class="codepoint" translate="no"><span lang="fa" dir="rtl">&#x0626;</span> [<span class="uname">U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE</span>]</span>

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/216#issuecomment-896682942 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 11 August 2021 09:53:34 UTC