[charmod-norm] Arabic & Hebrew unique issues associated with sections 2.4 and 2.5

tomerm has just created a new issue for 
https://github.com/w3c/charmod-norm:

== Arabic & Hebrew unique issues associated with sections 2.4 and 2.5 
==
I would like to comment on following sections of specification:
     - 2.4 Unicode Controls and Invisible Markers
     - 2.5 Legacy Character Encodings

**On Unicode Controls and Invisible Markers**
Languages with bidirectional scripts may include different sections 
(called directional runs) having different directions (i.e. Arabic 
words running from right to left , while Latin words and numbers 
running from left to right). It is not a secret that sentence in 
Arabic / Hebrew includes quite often Latin words / numbers. 
Readability of sentence is greatly affected by direction with which 
text is displayed (this direction affects relative order of 
directional runs as they are laid out on the screen). If this 
direction is different from natural direction of language in which 
sentence is expressed, it makes it incomprehensible. Unfortunately 
none of current technologies allows to specify direction of text (i.e.
 String in Java is a final class and does not include any information 
about text directionality). Thus unless there is a higher level 
protocol (i.e. HTML markup with dir attribute) which can be used for 
that purpose, there is no way to persist text direction information. 
Consequently many solutions rely on Unicode Control Characters. Those 
are explicitly mentioned in Unicode Bidi Algorithm specification: 
http://unicode.org/reports/tr9/. Those are valid Unicode character 
which don't have any glyph associated with them (namely they are 
invisible characters). However, they do  affect how text is displayed.
 For enforcing LTR text direction, text is usually enclosed between 
LRE and PDF control characters, while for enforcing RTL text 
direction, text is usually enclosed between RLE and PDF control 
characters. 
As a result of such techniques the text can include UCC characters 
which will for sure affect both search and sorting of the text. 
The suggested approach is to ignore UCC which can be used for storing 
text directionality during text sorting / searching. 

**On legacy characters encoding**
One of legacy code pages which worth considering in this context is 
https://en.wikipedia.org/wiki/EBCDIC. Majority of Bidi (Arabic / 
Hebrew) data stored on mainframe systems is in this code page (along 
with minority stored in Unicode). The problem is that Bidi data stored
 on modern operating systems (i.e. windows, android, iOS etc.) is 
radically different from mainframe. Bidi data is stored on those 
systems in different bidi layouts. Bidi layout basically affects 
relative order of characters / text segments in the text buffer. Code 
page conversion not necessarily takes this info account. Thus when 
data from mainframe is converted to Unicode it can be still in visual 
bidi layout used on mainframes (instead of logical bidi layout used on
 modern operating systems). It is impossible to compare data in 
different bidi layouts. It will lead to incorrect sort / search 
results.  
The suggestion is to assure that text being sorted / searched are 
present in the same bidi layout.

Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/80 using your GitHub 
account

Received on Saturday, 20 February 2016 18:19:20 UTC