Re: New article for REVIEW: Upgrading from language-specific legacy encoding to Unicode encoding from Simon Montagu on 2005-08-24 (www-international@w3.org from July to September 2005)

From: Simon Montagu <smontagu@smontagu.org>
Date: Wed, 24 Aug 2005 23:00:43 +0200
To: Mark Davis <mark.davis@icu-project.org>
Cc: Frank Yung-Fong Tang <franktang@gmail.com>, Jony Rosenne <rosennej@qsm.co.il>, www-international@w3.org, Markus Scherer <markus.scherer@us.ibm.com>
Message-ID: <430CDFFB.3080700@smontagu.org>
As well as the cases that Mark mentions where one visual sequence could 
map to more than one logical sequence, there are visual sequences that 
have no logical equivalent (unless control characters are inserted), e.g.

CBA123

In Gecko we work round this problem by splitting the text into runs so 
that each run contains characters with the same inherent directionality. 
When displaying a Visual Hebrew document on platforms where the 
rendering layer expects logical Hebrew, we then simply reverse the 
hebrew runs before sending them to the platform API. The platform then 
reverses them back to the original visual order.

Simon

Mark Davis wrote:
> I think what Jony is referring to is that there are multiple ways to go 
> from visual to logical. Each possibility can be consistent, in that
> 
>    toVisual(toLogical(X)) = X
> 
> however, they may not each be expected, and some combinations may 
> require insertion of LRM or RLM, and/or knowledge of the bidi 
> environment (http://www.unicode.org/reports/tr9/#Higher-Level_Protocols) 
> used in getting toVisual(). Some simple examples:
> 
> Visual: abBA
> could result from:
> Logical: abAB
> or
> Logical: ABab
> 
> Visual: BAab
> could result from:
> Logical: <RLM>abAB
> or
> Logical: <LRM>ABab
> 
> Mark
> 
> Frank Yung-Fong Tang wrote:
> 
>>
>>
>> 2005/8/24, Jony Rosenne <rosennej@qsm.co.il <mailto:rosennej@qsm.co.il>>:
>>
>>
>>     Where the text is long enough, a separate documnet linked to from
>>     the main
>>     document is in order.
>>
>>
>> agree.
>>
>>     For Hebrew, the situation is a little simpler: In the general case
>>     it is not
>>     possible to convert visual to logical automatically.
>>
>> Hum??? How can it be... Simon: did we do the visual hewbrew to logical 
>> hebrew conversion in Gecko before we pipe the ISO-8859-8 info to the 
>> Mac ATSUI ? It surely is a hard process but if that is not possible 
>> how can we deal with visual form on an environment which only support 
>> logical input ? (Like ATSUI or WorldScript II on MacOS)
>>
>>     Jony
>>
>>     > -----Original Message-----
>>     > From: Tex Texin [mailto: tex@xencraft.com 
>> <mailto:tex@xencraft.com>]
>>     > Sent: Wednesday, August 24, 2005 1:58 PM
>>     > To: Frank Yung-Fong Tang
>>     > Cc: Jony Rosenne; www-international@w3.org
>>     <mailto:www-international@w3.org>
>>     > Subject: Re: New article for REVIEW: Upgrading from
>>     > language-specific legacy encoding to Unicode encoding
>>     >
>>     >
>>     > I was going to make more or less the same comment, which is
>>     > that conversion
>>     > from legacy encodings to unicode is a difficult but necessary
>>     subject.
>>     > It is large so should be a separate faq or faqs, and should
>>     cover many
>>     > encodings, not just bidi.
>>     >
>>     > Any minute now, Richard is going to pipe up suggesting Joni
>>     > submit a faq for
>>     > hebrew and Frank one for double-byte encoding conversions, so
>>     > I'll preempt
>>     > him and suggest that as well. ;-)
>>     >
>>     > Although we could use a treatise on these issues, I wonder if
>>     > it would be
>>     > better to identify libraries or tools that do the job right
>>     > and give users
>>     > appropriate choices. I muck around with iconv, ICU, perl,
>>     > etc. and it is
>>     > very hard to know which tools will do the entire job
>>     > correctly, and which do
>>     > the minimum, or are several versions behind.
>>     >
>>     > For example, a convertor written for Unicode 2.0 would not
>>     > take advantage of
>>     > the characters in Unicode 4.x.
>>     > It is correct in some sense and incorrect in other ways. Also, a
>>     pure
>>     > encoding convertor would not take into account the needs of
>>     > the Web, and
>>     > perhaps issues of conversion to the bidi markup.
>>     >
>>     > And which tools offer a choice when it comes to converting
>>     > backslash to yen,
>>     > wan, etc. when used as currency?
>>     >
>>     > Many users are confused by which conversions to use. e.g. When
>>     to use
>>     > Windows-1252 instead of iso 8859-1, or when to use big5-hkscs
>>     > instead of
>>     > big-5, since often data is mislabeled?
>>     >
>>     > I think the tools view or roadmap may be more important than
>>     > the character
>>     > encoding details.
>>     >
>>     > But yes, it is a topic definitely needing expansion.
>>     > --
>>     > -------------------------------------------------------------
>>     > Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
>>     <mailto:Tex@XenCraft.com>
>>     > Xen Master                          http://www.i18nGuy.com
>>     >
>>     > XenCraft                          http://www.XenCraft.com
>>     > Making e-Business Work Around the World
>>     > -------------------------------------------------------------
>>     >
>>     >
>>     >
>>
>>
>>
>>
>>
>> -- 
>> Frank Yung-Fong Tang   譚永鋒
>> Šýšţém Årçĥîţéçţ
>>
>> Day: 703-265-6347                         
>> http://people.netscape.com/ftang
>> Skype: FrankYungFongTang           Yahoo IM: FrankYungFongTan
>> AIM ID: ytang0648                         MSN IM: 
>> FrankYungFongTang@hotmail.com <mailto:FrankYungFongTang@hotmail.com>
>>                          
> 
> 
>
Received on Wednesday, 24 August 2005 19:58:26 UTC