Re: [charmod-norm] which characters, exactly, should be removed in the matching algorithm? from Addison Phillips via GitHub on 2017-11-25 (public-i18n-archive@w3.org from October to December 2017)

From: Addison Phillips via GitHub <sysbot+gh@w3.org>
Date: Sat, 25 Nov 2017 21:06:03 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-346966163-1511643963-sysbot+gh@w3.org>

We've removed all of this guidance, leaving behind a section on "Additional Match Tailoring". Everyone satisfied?

> Some implementations might require additional tailoring to assist with matching. This might include removing certain Unicode controls or invisbile markers, mapping together or removing characters that are part of the syntax, or performing a whitespace trim.

> Specificiations need to clearly define any additional tailoring done as part of the matching process. Care should be taken not to interfere with the encoding of different languages. For example, a process that removes all combining characters based on Unicode character classes will not support languages that rely on combining marks and lead to user frustration. An example of this would be the various Indic scripts which use combining marks to encode or suppress vowels.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/117#issuecomment-346966163 using your GitHub account

Received on Saturday, 25 November 2017 21:06:07 UTC