Re: [charmod-norm] Add worked examples of case folding [I18N-ACTION-974] (#214) from r12a via GitHub on 2020-12-04 (public-i18n-archive@w3.org from October to December 2020)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Fri, 04 Dec 2020 16:52:30 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-738891992-1607100749-sysbot+gh@w3.org>
Here's how i'm framing the topic for myself. Does it help? (Either for planning your document explanation, or for showing me where i'm missing something.)

If you case fold the precomposed (NFC) character <span class="codepoint" translate="no"><span lang="el">&#x1F8C;</span> [<a href="/scripts/greek/block#char1F8C"><span class="uname">U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI</span></a>]</span> (which is the most common way to represent this combination of base and diacritic characters) and you just run case fold transformations you end up with:
<span class="codepoint" translate="no"><span lang="el">&#x1F04;&#x03B9;</span> [<a href="/scripts/greek/block#char1F04"><span class="uname">U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA</span></a> + <a href="/scripts/greek/block#char03B9"><span class="uname">U+03B9 GREEK SMALL LETTER IOTA</span></a>]</span>

If you start from a fully decomposed (NFD) sequence representing the same letter, <span class="codepoint" translate="no"><span lang="el">&#x0391;&#x0313;&#x0301;&#x0345;</span> [<a href="/scripts/greek/block#char0391"><span class="uname">U+0391 GREEK CAPITAL LETTER ALPHA</span></a> + <a href="/scripts/greek/block#char0313"><span class="uname">U+0313 COMBINING COMMA ABOVE</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a> + <a href="/scripts/greek/block#char0345"><span class="uname">U+0345 COMBINING GREEK YPOGEGRAMMENI</span></a>]</span>
you end up with 
<span class="codepoint" translate="no"><span lang="el">&#x03B1;&#x0313;&#x0301;&#x03B9;</span> [<a href="/scripts/greek/block#char03B1"><span class="uname">U+03B1 GREEK SMALL LETTER ALPHA</span></a> + <a href="/scripts/greek/block#char0313"><span class="uname">U+0313 COMBINING COMMA ABOVE</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a> + <a href="/scripts/greek/block#char03B9"><span class="uname">U+03B9 GREEK SMALL LETTER IOTA</span></a>]</span>

Clearly, these two don't match, and some normalisation will be necessary. However, in both of those cases, the acute accent is associated with the alpha base character.

If, however, you begin with the half-precomposed sequence <span class="codepoint" translate="no"><span lang="el">&#x1F88;&#x0301;</span> [<a href="/scripts/greek/block#char1F88"><span class="uname">U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a>]</span>
you end up with 
<span class="codepoint" translate="no"><span lang="el">&#x1F00;&#x03B9;&#x0301;</span> [<a href="/scripts/greek/block#char1F00"><span class="uname">U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI</span></a> + <a href="/scripts/greek/block#char03B9"><span class="uname">U+03B9 GREEK SMALL LETTER IOTA</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a>]</span>
where the acute accent is associated with the iota. 

This produces a sequence that can't be normalised to match the others!

The way to resolve this problem is to normalise all the text beforehand to NFD. Then 
<span class="codepoint" translate="no"><span lang="el">&#x1F8C;</span> [<a href="/scripts/greek/block#char1F8C"><span class="uname">U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI</span></a>]</span>
and 
<span class="codepoint" translate="no"><span lang="el">&#x1F88;&#x0301;</span> [<a href="/scripts/greek/block#char1F88"><span class="uname">U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a>]</span>
both end up the same as the decomposed version, ie. 
<span class="codepoint" translate="no"><span lang="el">&#x0391;&#x0313;&#x0301;&#x0345;</span> [<a href="/scripts/greek/block#char0391"><span class="uname">U+0391 GREEK CAPITAL LETTER ALPHA</span></a> + <a href="/scripts/greek/block#char0313"><span class="uname">U+0313 COMBINING COMMA ABOVE</span></a> + <a href="/scripts/greek/block#char0301"><span class="uname">U+0301 COMBINING ACUTE ACCENT</span></a> + <a href="/scripts/greek/block#char0345"><span class="uname">U+0345 COMBINING GREEK YPOGEGRAMMENI</span></a>]</span>

If you now case-fold that sequence, it produces a match for all cases.

What i'm still not clear about, is why you need to double-normalise.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/charmod-norm/pull/214#issuecomment-738891992 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Friday, 4 December 2020 16:52:36 UTC