[review feedback] Visual vs. logical ordering of text

These are consolidated comments from IBM Bidi Globalization Center of 
Competency on the document stored at:  
http://www.w3.org/International/tutorials/new-bidi-xhtml/qa-visual-vs-logical

General observations

  In modern systems in which backend storage including legacy data 
(created at some point using green screens) is represented by visual 
system (such as mainframe or iSeries) it is required to support 
bidirectional flow of data between back end (visual ordering) and web 
front end (logical ordering). 
Two things might happen when data is passed between those back and front 
ends: 
    a. Code page conversion
    b. Bidi layout transformation

The first one is required since bidi data is represented on different 
systems with different code pages (i.e. EBCDIC on visual back end systems 
and ASCII / Unicode on logical front end systems)
The second should occur since visual and logical systems have different 
approaches for correlation between Bidi data storage and display. 

Following data integrity issues should be taken into account from code 
page conversion perspective: 
- Code page conversion for Arabic between Unicode and EBCDIC usually 
imposes a problem with Arabic and Data Integrity if not handled carefully, 
this is because we have some ligatures "like Lam Alef character" that is 
stored as one character in EBCDIC and two characters in Unicode.
- In addition to that, the shaped form of Arabic EBCDIC data when 
converted to the isolated Unicode form might have data integrity problem 
also when being converted back to EBCDIC codepage if not handled properly, 
the same might happen also for Arabic-Indic digits which is stored in this 
format in EBCDIC codepage.
Those issues are unique for Arabic language.

Following data integrity issue should be taken into account from bidi 
layout transformation perspective:
- Since UBA conversion between visual and logical ordering is in general 
irreversible for preserving consistency of data it is required not to 
translate it to logical ordering schema. This is required when such data 
is being edited in web front end. For proper handling of visual data in 
such cases UBA working on logical platforms should be disabled or 
overwritten. A technique for achieving this goal through UCC (such as LRO) 
is described in section "overriding the algorithm" in 
http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.
Modern Dojo based toolkits come with controls which leverage such 
technique to provide native experience for working / editing of visual 
data.
This data integrity issue is common to both Arabic and Hebrew languages. 


Section Quick Answer
> "... Visual ordering of text was a common way of representing Hebrew in 
HTML on old user agents that didn't support the Unicode bidirectional 
algorithm. Very little persists today. Characters making up the text were 
stored in the source code in the same order you would see them displayed 
on screen when looking from left to right.
(Visual ordering isn't really seen much for Arabic. Since the Arabic 
letters are all joined up there was a stronger motivation on the part of 
Arabic implementers to enable the logical ordering approach.)..."
Visual ordering of text is a common way of representing Arabic / Hebrew on 
systems which don't support the UBA such as mainframe or iSeries. Those 
systems are still widely used today. On such systems, characters making up 
the text are stored in the source code in the same order you would see 
them displayed on screen when looking from left to right.

>"... You should always create HTML (and any other type of markup) using 
logical ordering, and never use visual. ..."
Whenever possible you should strive to create HTML (and any other type of 
markup) using logical ordering

Section Visual ordering and its shortcomings
>"...To make visual ordering work, in addition to writing the text 
backwards, "
Not necessarily. While this is true for "green screens", autopush feature 
in green screen emulators allow you to type Bidi text in the natural 
order. 

Section Visual ordering and character encodings
Here is the list of correlation between different most popular character 
encoding commonly used on visual platforms (such as iSeries) and 
corresponding bidi layout characteristics such as ordering schema (which 
can be visual or logical).

CCSID: 420 (string type: 4, Code page: 420 description: EBCDIC (original 
CCSID for Arabic Data)
CCSID: 425 (string type: 5, Code page: 425 description: EBCDIC with POSIX 
chars, like [] {} etc.)
CCSID: 424 (string type: 4, Code page 424  description: EBCDIC (original 
CCSID for Hebrew data).

If you agree to incorporate the list of CCSID details I can provide 
additional ones :-)))

String Type identifies properties of Bidi layout which should be taken 
into account during bidi layout transformation
string type 4 (Text Type = visual, numeric-shaping = pass-through, 
Orientation= LTR, Text Shaping = shaped, Symmetric Swapping = off)
string type 5 (Text Type = implicit, numeric-shaping = Arabic, 
Orientation= LTR, Text Shaping = unshaped, Symmetric Swapping = on)

If you agree to incorporate this list of string type I can provide 
additional data on string types 6-12 used on legacy systems :-)))

Additional information on bidi layout properties is as follows: 
Orientation: In bidirectional languages, some characters, such as English 
letters, are considered to have a strong left-to-right orientation. Other 
characters, such as the Arabic characters, are considered strong 
right-to-left characters. And other characters, such as punctuation marks, 
spaces, and so on, do not have a strong direction associated with them. 
These are also contextual. In this situation, the global orientation is 
set according to the direction of the first significant (strong) 
character.
Numeric Shaping: In Arabic, it is common to use Hindi numbers instead of 
Arabic numbers. "1" "2" etc. are the Arabic version of the numbers.
Text Shaping: Specifies the shaping: that is, choosing (or composing) the 
correct shape of the input or output text.
Note: This value is important, in particular for languages where the 
shapes of the characters, when presented, correspond to code points that 
may be different from the code points of the characters stored for 
processing. In languages such as Arabic or Farsi, the character can have 
up to four different shapes (see Shapes of the Arabic Characters). In 
these languages the character is most frequently (but not always) stored 
and processed using a code point related to a basic shape. Often the basic 
shape chosen is the isolated shape.
An Arabic Script character often has initial form, middle form, final 
form, and isolated form
Symmetrical Swapping: The Swapping descriptor specifies whether symmetric 
swapping is applied to the text. A list of symmetric swapping characters 
is given in the ISO/IEC 10646 standard. For example, the string "(1)" 
without might become ")1("


Best Regards,

Tomer Mahlin
GCoC Bidi architect
Bidi Development Lab


Phone: +972-2-6491784 | Mobile: +972-54-3368122 E-mail: tomerm@il.ibm.com

IBM R&D Labs
Malcha Technology Park
Jerusalem 96951
 Israel

Received on Tuesday, 5 March 2013 09:56:27 UTC