Re: [charmod-norm] Does ZWJ/ZWNJ affect meaning? from r12a via GitHub on 2016-04-05 (public-i18n-archive@w3.org from April to June 2016)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Tue, 05 Apr 2016 17:45:20 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-205916052-1459878319-sysbot+gh@w3.org>

I'm not sure we're finished with this section yet.  We note that 
matching can fail if these characters are hanging around, but we don't
 say what the implications of that are for implementers.  (I don't 
think we mean to say that content authors shouldn't use them.)

I think the question that needs answering is whether the implementers 
should ignore the invisible characters while matching strings, or not.
  The answer is apparently not clear cut.  

[1] In most cases it seems that they should be ignored – including 
Persian cases such as بهرهوری vs بهره‌وری, where it makes a difference
 to the reader but not to the machine that is doing the comparison.

[2] However, there appear to be some specific instances in certain 
locales where certain characters do make a semantic difference, such 
as where they create the distinction in meaning between تنها ("alone")
 and the word تن‌ها  ("bodies" or "corpuses") in Persian.  If the 
matching algorithm needs to be sensitive to semantic differences of 
this kind, then the implementation needs to take the invisible 
characters used in those particular circumstances into account.  By 
implication, it therefore needs to know how to identify those 
circumstances.

[3] We should also note that good matching algorithms may have to 
contend with workarounds that people apply, such as the example 
mentioned by @Ladsgroup with بهره وری. In this case, the matching 
algorithm needs to be aware that some people use spaces instead of 
zwnj, so in such cases the space should be ignored when matching.  
(Similar issues may arise in nastaliq script, where spaces may or may 
not be used as word separators, as i understand.) (And by the way we 
haven't touched on morphological reduction for matching, which is 
particularly difficult in arabic, but also complicated by 
agglutinative languages – but that's another story.)

By the way, i suspect that similar things could be said about 
invisible characters in Mongolian. I think that the FVS characters 
change the visual appearance rather than the meaning, whereas the MVS 
character (also invisible, and also producing an effect on the glyph 
rendering) also indicates a suffix boundary that may be semantically 
significant. We could ask on the mongolian list.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/44#issuecomment-205916052 
using your GitHub account

Received on Tuesday, 5 April 2016 17:45:22 UTC