- From: tomerm via GitHub <sysbot+gh@w3.org>
- Date: Sat, 20 Feb 2016 18:19:18 +0000
- To: public-i18n-archive@w3.org
tomerm has just created a new issue for https://github.com/w3c/charmod-norm: == Arabic & Hebrew unique issues associated with sections 2.4 and 2.5 == I would like to comment on following sections of specification: - 2.4 Unicode Controls and Invisible Markers - 2.5 Legacy Character Encodings **On Unicode Controls and Invisible Markers** Languages with bidirectional scripts may include different sections (called directional runs) having different directions (i.e. Arabic words running from right to left , while Latin words and numbers running from left to right). It is not a secret that sentence in Arabic / Hebrew includes quite often Latin words / numbers. Readability of sentence is greatly affected by direction with which text is displayed (this direction affects relative order of directional runs as they are laid out on the screen). If this direction is different from natural direction of language in which sentence is expressed, it makes it incomprehensible. Unfortunately none of current technologies allows to specify direction of text (i.e. String in Java is a final class and does not include any information about text directionality). Thus unless there is a higher level protocol (i.e. HTML markup with dir attribute) which can be used for that purpose, there is no way to persist text direction information. Consequently many solutions rely on Unicode Control Characters. Those are explicitly mentioned in Unicode Bidi Algorithm specification: http://unicode.org/reports/tr9/. Those are valid Unicode character which don't have any glyph associated with them (namely they are invisible characters). However, they do affect how text is displayed. For enforcing LTR text direction, text is usually enclosed between LRE and PDF control characters, while for enforcing RTL text direction, text is usually enclosed between RLE and PDF control characters. As a result of such techniques the text can include UCC characters which will for sure affect both search and sorting of the text. The suggested approach is to ignore UCC which can be used for storing text directionality during text sorting / searching. **On legacy characters encoding** One of legacy code pages which worth considering in this context is https://en.wikipedia.org/wiki/EBCDIC. Majority of Bidi (Arabic / Hebrew) data stored on mainframe systems is in this code page (along with minority stored in Unicode). The problem is that Bidi data stored on modern operating systems (i.e. windows, android, iOS etc.) is radically different from mainframe. Bidi data is stored on those systems in different bidi layouts. Bidi layout basically affects relative order of characters / text segments in the text buffer. Code page conversion not necessarily takes this info account. Thus when data from mainframe is converted to Unicode it can be still in visual bidi layout used on mainframes (instead of logical bidi layout used on modern operating systems). It is impossible to compare data in different bidi layouts. It will lead to incorrect sort / search results. The suggestion is to assure that text being sorted / searched are present in the same bidi layout. Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/80 using your GitHub account
Received on Saturday, 20 February 2016 18:19:20 UTC