- From: r12a via GitHub <sysbot+gh@w3.org>
- Date: Tue, 05 Apr 2016 17:45:20 +0000
- To: public-i18n-archive@w3.org
I'm not sure we're finished with this section yet. We note that matching can fail if these characters are hanging around, but we don't say what the implications of that are for implementers. (I don't think we mean to say that content authors shouldn't use them.) I think the question that needs answering is whether the implementers should ignore the invisible characters while matching strings, or not. The answer is apparently not clear cut. [1] In most cases it seems that they should be ignored – including Persian cases such as بهرهوری vs بهرهوری, where it makes a difference to the reader but not to the machine that is doing the comparison. [2] However, there appear to be some specific instances in certain locales where certain characters do make a semantic difference, such as where they create the distinction in meaning between تنها ("alone") and the word تنها ("bodies" or "corpuses") in Persian. If the matching algorithm needs to be sensitive to semantic differences of this kind, then the implementation needs to take the invisible characters used in those particular circumstances into account. By implication, it therefore needs to know how to identify those circumstances. [3] We should also note that good matching algorithms may have to contend with workarounds that people apply, such as the example mentioned by @Ladsgroup with بهره وری. In this case, the matching algorithm needs to be aware that some people use spaces instead of zwnj, so in such cases the space should be ignored when matching. (Similar issues may arise in nastaliq script, where spaces may or may not be used as word separators, as i understand.) (And by the way we haven't touched on morphological reduction for matching, which is particularly difficult in arabic, but also complicated by agglutinative languages – but that's another story.) By the way, i suspect that similar things could be said about invisible characters in Mongolian. I think that the FVS characters change the visual appearance rather than the meaning, whereas the MVS character (also invisible, and also producing an effect on the glyph rendering) also indicates a suffix boundary that may be semantically significant. We could ask on the mongolian list. -- GitHub Notification of comment by r12a Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/44#issuecomment-205916052 using your GitHub account
Received on Tuesday, 5 April 2016 17:45:22 UTC