- From: Tomer Mahlin <TOMERM@il.ibm.com>
- Date: Tue, 5 Mar 2013 07:00:50 +0200
- To: <www-international@w3.org>
- Cc: "Richard Ishida" <ishida@w3.org>
- Message-ID: <OFB6C89F99.6B5C8EB4-ONC2257B25.001B5091-42257B25.001BC49E@il.ibm.com>
These are consolidated comments from IBM Bidi Globalization Center of
Competency on the document stored at:
http://www.w3.org/International/tutorials/new-bidi-xhtml/qa-visual-vs-logical
General observations
In modern systems in which backend storage including legacy data
(created at some point using green screens) is represented by visual
system (such as mainframe or iSeries) it is required to support
bidirectional flow of data between back end (visual ordering) and web
front end (logical ordering).
Two things might happen when data is passed between those back and front
ends:
a. Code page conversion
b. Bidi layout transformation
The first one is required since bidi data is represented on different
systems with different code pages (i.e. EBCDIC on visual back end systems
and ASCII / Unicode on logical front end systems)
The second should occur since visual and logical systems have different
approaches for correlation between Bidi data storage and display.
Following data integrity issues should be taken into account from code
page conversion perspective:
- Code page conversion for Arabic between Unicode and EBCDIC usually
imposes a problem with Arabic and Data Integrity if not handled carefully,
this is because we have some ligatures "like Lam Alef character" that is
stored as one character in EBCDIC and two characters in Unicode.
- In addition to that, the shaped form of Arabic EBCDIC data when
converted to the isolated Unicode form might have data integrity problem
also when being converted back to EBCDIC codepage if not handled properly,
the same might happen also for Arabic-Indic digits which is stored in this
format in EBCDIC codepage.
Those issues are unique for Arabic language.
Following data integrity issue should be taken into account from bidi
layout transformation perspective:
- Since UBA conversion between visual and logical ordering is in general
irreversible for preserving consistency of data it is required not to
translate it to logical ordering schema. This is required when such data
is being edited in web front end. For proper handling of visual data in
such cases UBA working on logical platforms should be disabled or
overwritten. A technique for achieving this goal through UCC (such as LRO)
is described in section "overriding the algorithm" in
http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.
Modern Dojo based toolkits come with controls which leverage such
technique to provide native experience for working / editing of visual
data.
This data integrity issue is common to both Arabic and Hebrew languages.
Section Quick Answer
> "... Visual ordering of text was a common way of representing Hebrew in
HTML on old user agents that didn't support the Unicode bidirectional
algorithm. Very little persists today. Characters making up the text were
stored in the source code in the same order you would see them displayed
on screen when looking from left to right.
(Visual ordering isn't really seen much for Arabic. Since the Arabic
letters are all joined up there was a stronger motivation on the part of
Arabic implementers to enable the logical ordering approach.)..."
Visual ordering of text is a common way of representing Arabic / Hebrew on
systems which don't support the UBA such as mainframe or iSeries. Those
systems are still widely used today. On such systems, characters making up
the text are stored in the source code in the same order you would see
them displayed on screen when looking from left to right.
>"... You should always create HTML (and any other type of markup) using
logical ordering, and never use visual. ..."
Whenever possible you should strive to create HTML (and any other type of
markup) using logical ordering
Section Visual ordering and its shortcomings
>"...To make visual ordering work, in addition to writing the text
backwards, "
Not necessarily. While this is true for "green screens", autopush feature
in green screen emulators allow you to type Bidi text in the natural
order.
Section Visual ordering and character encodings
Here is the list of correlation between different most popular character
encoding commonly used on visual platforms (such as iSeries) and
corresponding bidi layout characteristics such as ordering schema (which
can be visual or logical).
CCSID: 420 (string type: 4, Code page: 420 description: EBCDIC (original
CCSID for Arabic Data)
CCSID: 425 (string type: 5, Code page: 425 description: EBCDIC with POSIX
chars, like [] {} etc.)
CCSID: 424 (string type: 4, Code page 424 description: EBCDIC (original
CCSID for Hebrew data).
If you agree to incorporate the list of CCSID details I can provide
additional ones :-)))
String Type identifies properties of Bidi layout which should be taken
into account during bidi layout transformation
string type 4 (Text Type = visual, numeric-shaping = pass-through,
Orientation= LTR, Text Shaping = shaped, Symmetric Swapping = off)
string type 5 (Text Type = implicit, numeric-shaping = Arabic,
Orientation= LTR, Text Shaping = unshaped, Symmetric Swapping = on)
If you agree to incorporate this list of string type I can provide
additional data on string types 6-12 used on legacy systems :-)))
Additional information on bidi layout properties is as follows:
Orientation: In bidirectional languages, some characters, such as English
letters, are considered to have a strong left-to-right orientation. Other
characters, such as the Arabic characters, are considered strong
right-to-left characters. And other characters, such as punctuation marks,
spaces, and so on, do not have a strong direction associated with them.
These are also contextual. In this situation, the global orientation is
set according to the direction of the first significant (strong)
character.
Numeric Shaping: In Arabic, it is common to use Hindi numbers instead of
Arabic numbers. "1" "2" etc. are the Arabic version of the numbers.
Text Shaping: Specifies the shaping: that is, choosing (or composing) the
correct shape of the input or output text.
Note: This value is important, in particular for languages where the
shapes of the characters, when presented, correspond to code points that
may be different from the code points of the characters stored for
processing. In languages such as Arabic or Farsi, the character can have
up to four different shapes (see Shapes of the Arabic Characters). In
these languages the character is most frequently (but not always) stored
and processed using a code point related to a basic shape. Often the basic
shape chosen is the isolated shape.
An Arabic Script character often has initial form, middle form, final
form, and isolated form
Symmetrical Swapping: The Swapping descriptor specifies whether symmetric
swapping is applied to the text. A list of symmetric swapping characters
is given in the ISO/IEC 10646 standard. For example, the string "(1)"
without might become ")1("
Best Regards,
Tomer Mahlin
GCoC Bidi architect
Bidi Development Lab
Phone: +972-2-6491784 | Mobile: +972-54-3368122 E-mail: tomerm@il.ibm.com
IBM R&D Labs
Malcha Technology Park
Jerusalem 96951
Israel
Attachments
Received on Tuesday, 5 March 2013 09:56:27 UTC