- From: r12a via GitHub <sysbot+gh@w3.org>
- Date: Tue, 05 Apr 2016 17:45:20 +0000
- To: public-i18n-archive@w3.org
I'm not sure we're finished with this section yet. We note that
matching can fail if these characters are hanging around, but we don't
say what the implications of that are for implementers. (I don't
think we mean to say that content authors shouldn't use them.)
I think the question that needs answering is whether the implementers
should ignore the invisible characters while matching strings, or not.
The answer is apparently not clear cut.
[1] In most cases it seems that they should be ignored – including
Persian cases such as بهرهوری vs بهرهوری, where it makes a difference
to the reader but not to the machine that is doing the comparison.
[2] However, there appear to be some specific instances in certain
locales where certain characters do make a semantic difference, such
as where they create the distinction in meaning between تنها ("alone")
and the word تنها ("bodies" or "corpuses") in Persian. If the
matching algorithm needs to be sensitive to semantic differences of
this kind, then the implementation needs to take the invisible
characters used in those particular circumstances into account. By
implication, it therefore needs to know how to identify those
circumstances.
[3] We should also note that good matching algorithms may have to
contend with workarounds that people apply, such as the example
mentioned by @Ladsgroup with بهره وری. In this case, the matching
algorithm needs to be aware that some people use spaces instead of
zwnj, so in such cases the space should be ignored when matching.
(Similar issues may arise in nastaliq script, where spaces may or may
not be used as word separators, as i understand.) (And by the way we
haven't touched on morphological reduction for matching, which is
particularly difficult in arabic, but also complicated by
agglutinative languages – but that's another story.)
By the way, i suspect that similar things could be said about
invisible characters in Mongolian. I think that the FVS characters
change the visual appearance rather than the meaning, whereas the MVS
character (also invisible, and also producing an effect on the glyph
rendering) also indicates a suffix boundary that may be semantically
significant. We could ask on the mongolian list.
--
GitHub Notification of comment by r12a
Please view or discuss this issue at
https://github.com/w3c/charmod-norm/issues/44#issuecomment-205916052
using your GitHub account
Received on Tuesday, 5 April 2016 17:45:22 UTC