W3C home > Mailing lists > Public > public-i18n-archive@w3.org > January to March 2017

[charmod-norm] which characters, exactly, should be removed in the matching algorithm?

From: Addison Phillips via GitHub <sysbot+gh@w3.org>
Date: Sat, 28 Jan 2017 00:05:58 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-203770449-1485561957-sysbot+gh@w3.org>
aphillips has just created a new issue for 
https://github.com/w3c/charmod-norm:

== which characters, exactly, should be removed in the matching 
algorithm? ==
In the latest version of the matching algorithm, I noticed that the 
advice is to "remove Unicode controls", but this was non-specific and 
linked to the section on invisibles. I created the following list and 
also had a question about whether this was complete or correct:

> Issue 1
> 
> What to do about non-breaking space and other space characters? Is 
this the full list? What about the 
> Mongolian characters?
> Remove all of the following invisible Unicode characters:
>
>    ZWJ, ZWNJ
>    Variation Selectors (FE00..FE0F)
>    COMBINING GRAPHEME JOINER 034F
>    SOFT HYPHEN 00AD
>    ZERO WIDTH SPACE 200B
>    Bidi controls 



Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/117 using your GitHub 
account
Received on Saturday, 28 January 2017 00:06:04 UTC

This archive was generated by hypermail 2.4.0 : Monday, 4 July 2022 18:09:36 UTC