Re: [charmod-norm] which characters, exactly, should be removed in the matching algorithm?

The section states "In almost all of these cases, users may not be 
aware of or cannot be sure if a given document or text string has 
included or omitted one of these characters."

This is actually only partially true. While users may not be aware of 
the presence of these characters, when they do affect the layout of a 
word, that difference is certainly visible to a user.

That's why in IDNA 2008 there are context rules for ZWJ / ZWNJ that 
attempt to distinguish cases where the joiners have no effect (or only
 optional effect) and where they do. For example, adding a non-joiner 
where two characters would have joined makes a clear visual 
difference; doing the same thing between two characters that wouldn't 
have joined anyway remains undetectable.

The rules treat the conjunct formation also as making these characters
 effect visible; that is, their presence is allowed in those locations
 in an IDNA2008 identifier (which makes them meaningful for matching).

SHY is an example of a character which is always invisible _*except*_ 
when it causes the word to break across a line. There's a much 
stronger case for filtering it unconditionally in matching.

Variation selectors are more akin to font selection; if no suitable 
glyphs exist, they have no visible effect. Removal seems fine.

ZWSP and WJ (the replacement for ZWNBSP, aka BOM) are like SHY, they 
have no visible effects except in line breaking or segmentation. 
Removal seems fine.

And so on.
ZWJ for conjunc

-- 
GitHub Notification of comment by asmusf
Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/117#issuecomment-275885538 
using your GitHub account

Received on Sunday, 29 January 2017 00:40:58 UTC