[charmod-norm] Arabic & Hebrew issues with 2.4

r12a has just created a new issue for 
https://github.com/w3c/charmod-norm:

== Arabic & Hebrew issues with 2.4 ==
[ moved here from issue #80 ]
raised by tomerm

**On Unicode Controls and Invisible Markers**
Languages with bidirectional scripts may include different sections 
(called directional runs) having different directions (i.e. Arabic 
words running from right to left , while Latin words and numbers 
running from left to right). It is not a secret that sentence in 
Arabic / Hebrew includes quite often Latin words / numbers. 
Readability of sentence is greatly affected by direction with which 
text is displayed (this direction affects relative order of 
directional runs as they are laid out on the screen). If this 
direction is different from natural direction of language in which 
sentence is expressed, it makes it incomprehensible. Unfortunately 
none of current technologies allows to specify direction of text (i.e.
 String in Java is a final class and does not include any information 
about text directionality). Thus unless there is a higher level 
protocol (i.e. HTML markup with dir attribute) which can be used for 
that purpose, there is no way to persist text direction information. 
Consequently many solutions rely on Unicode Control Characters. Those 
are explicitly mentioned in Unicode Bidi Algorithm specification: 
http://unicode.org/reports/tr9/. Those are valid Unicode character 
which don't have any glyph associated with them (namely they are 
invisible characters). However, they do affect how text is displayed. 
For enforcing LTR text direction, text is usually enclosed between LRE
 and PDF control characters, while for enforcing RTL text direction, 
text is usually enclosed between RLE and PDF control characters.
As a result of such techniques the text can include UCC characters 
which will for sure affect both search and sorting of the text.
The suggested approach is to ignore UCC which can be used for storing 
text directionality during text sorting / searching. 

Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/82 using your GitHub 
account

Received on Monday, 4 April 2016 17:34:07 UTC