- From: tomerm via GitHub <sysbot+gh@w3.org>
- Date: Sat, 20 Feb 2016 18:19:18 +0000
- To: www-international@w3.org
tomerm has just created a new issue for
https://github.com/w3c/charmod-norm:
== Arabic & Hebrew unique issues associated with sections 2.4 and 2.5
==
I would like to comment on following sections of specification:
- 2.4 Unicode Controls and Invisible Markers
- 2.5 Legacy Character Encodings
**On Unicode Controls and Invisible Markers**
Languages with bidirectional scripts may include different sections
(called directional runs) having different directions (i.e. Arabic
words running from right to left , while Latin words and numbers
running from left to right). It is not a secret that sentence in
Arabic / Hebrew includes quite often Latin words / numbers.
Readability of sentence is greatly affected by direction with which
text is displayed (this direction affects relative order of
directional runs as they are laid out on the screen). If this
direction is different from natural direction of language in which
sentence is expressed, it makes it incomprehensible. Unfortunately
none of current technologies allows to specify direction of text (i.e.
String in Java is a final class and does not include any information
about text directionality). Thus unless there is a higher level
protocol (i.e. HTML markup with dir attribute) which can be used for
that purpose, there is no way to persist text direction information.
Consequently many solutions rely on Unicode Control Characters. Those
are explicitly mentioned in Unicode Bidi Algorithm specification:
http://unicode.org/reports/tr9/. Those are valid Unicode character
which don't have any glyph associated with them (namely they are
invisible characters). However, they do affect how text is displayed.
For enforcing LTR text direction, text is usually enclosed between
LRE and PDF control characters, while for enforcing RTL text
direction, text is usually enclosed between RLE and PDF control
characters.
As a result of such techniques the text can include UCC characters
which will for sure affect both search and sorting of the text.
The suggested approach is to ignore UCC which can be used for storing
text directionality during text sorting / searching.
**On legacy characters encoding**
One of legacy code pages which worth considering in this context is
https://en.wikipedia.org/wiki/EBCDIC. Majority of Bidi (Arabic /
Hebrew) data stored on mainframe systems is in this code page (along
with minority stored in Unicode). The problem is that Bidi data stored
on modern operating systems (i.e. windows, android, iOS etc.) is
radically different from mainframe. Bidi data is stored on those
systems in different bidi layouts. Bidi layout basically affects
relative order of characters / text segments in the text buffer. Code
page conversion not necessarily takes this info account. Thus when
data from mainframe is converted to Unicode it can be still in visual
bidi layout used on mainframes (instead of logical bidi layout used on
modern operating systems). It is impossible to compare data in
different bidi layouts. It will lead to incorrect sort / search
results.
The suggestion is to assure that text being sorted / searched are
present in the same bidi layout.
See https://github.com/w3c/charmod-norm/issues/80
Further comments on this issue will NOT be notified to this list. If
you'd like to follow the discussion, please do so by subscribing to
the issue via the above link. Do not reply to this email.
Received on Saturday, 20 February 2016 18:19:21 UTC