Re: Proposal for isolation characters in Unicode and the unicode-bidi:isolate and unicode-bidi:plaintext definitions from Aharon (Vladimir) Lanin on 2012-05-15 (public-i18n-bidi@w3.org from April to June 2012)

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Tue, 15 May 2012 16:21:43 +0200
To: Mohamed Mohie <MOHIEM@eg.ibm.com>
Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, public-i18n-bidi@w3.org
Message-ID: <CA+FsOYZD-xEcYNw9CKdNpiTYpBQs2ycR9tRCNom24iz97PAmKw@mail.gmail.com>
Let's say that a database holds information about books, including their
titles. Let's say that there is a book in an RTL language whose RTL title
is meant to be displayed as

ELPMIS EDAM c++ DECNAVDA

In logical order using existing Unicode characters, it could be represented
as

ADVANCED c++[LRM] MADE SIMPLE

or as

ADVANCED [LRE]c++[PDF] MADE SIMPLE

or even as

ADVANCED [LRE]c++[PDF][RLM] MADE SIMPLE

All will be displayed as intended in an RTL context.

However, the database is going to be used by a simple-minded application
whose authors have never heard of bidi and will simply plug the values they
get out of the database into the text they generate. If the application is
LTR, the book title will be garbled (the same way for all the versions
above):

DECNAVDA c++ ELPMIS EDAM

On the other hand, if the book title uses the proposed isolation
characters, i.e.

ADVANCED [FSI]c++[PDF] MADE SIMPLE

it comes out correct both in LTR and RTL contexts.

Thus, using isolation is more robust than using embedding and/or LRM/RLM.

Of course, this will not work for all book titles. For example,

ADVANCED [FSI]c++[PDF]

will be garbled as

DECNAVDA c++

in an LTR context (just like "ADVANCED c++[LRM]" and the other current
approaches).

Furthermore, the same title will be garbled even in an RTL context when
followed by a number, e.g. " (29 MAY 2008)":

(2008 YAM c++ (29 DECNAVDA

instead of

(2008 YAM 29) c++ DECNAVDA

However, the simple-minded application can be easily fixed to deal with
this if the new characters are available to it. All it has to do is wrap
each book title in FSI and PDF. That is,

[FSI]ADVANCED [FSI]c++[PDF][PDF]

will be displayed correctly in both LTR and RTL, whether it is followed by
a number or not.

Wrapping each book title in FSI and PDF is much, much easier than wrapping
it in in either LRE or RLE at one end and PDF and either LRM or RLM on the
other. The choice between LRE and RLE depends on the title's
directionality, which is usually not directly available to the application.
And the choice between LRM and RLM depends on the directionality of the
application's overall output (i.e. locale), which may not be available in
the code layer inserting a book title into the application's output. These
difficulties usually prevent all but the most advanced applicatoins from
displaying opposite-diectionality text data correctly, and the hope is that
the new characters will lower this barrier tremendously.

Aharon

On Tue, May 15, 2012 at 1:21 PM, Mohamed Mohie <MOHIEM@eg.ibm.com> wrote:

> Hello Aharon,
> It's not clear to me what problems these additional characters can solve
> which we can't solve in the current UBA by combining LRE/RLE and inserting
> LRM/RLM?
>
> Thanks And Best regards,
> Mohamed Mohie , PMP®
> _______________________________________________________
> Manager of Arabic Competence and Globalization Center (ACGC)
> GCoC BIDI , Advisory Software Engineer, Project Manager, M.Sc.
> Cairo Technology Development Center (CTDC)
> IBM Egypt-
> email : mohiem@eg.ibm.com
>
>
>
>
>
> From:   "Aharon (Vladimir) Lanin" <aharon@google.com>
> To:     Martin J. Dürst <duerst@it.aoyama.ac.jp>
> Cc:     public-i18n-bidi@w3.org
> Date:   15/05/2012 11:09 ص
> Subject:        Re: Proposal for isolation characters in Unicode and the
>            unicode-bidi:isolate and unicode-bidi:plaintext definitions
>
>
>
> [-www-style]
>
> I guess public-i18n-bidi is an ok place to discuss the Unicode proposal.
> But would it not be better to do so on some Unicode list, at least in
> addition to here?
>
>
>  It may be worth considering to create a new character to close these
>  embeddings. Otherwise, older algorithms will close LRE/RLE/LRO/RLO
>  embeddings/overrides prematurely.
>
> Good point.
>
>
>  Another question: What's the relationship between this proposal and the
>  new bidi control character that was proposed (I think by Apple) around
>  last November's UTC?
>
> I guess you are referring to http://www.unicode.org/review/pri205/ ("LEVEL
> DIRECTION MARK (LDM) behaves like a direction mark which dynamically takes
> on the resolved direction associated with the current embedding level")
>
> Using the current Unicode feature set, the way to deal with an
> opposite-direction inline insert is to declare its direction with LRE|RLE +
> PDF around it (to ensure the correct ordering inside the insert),
> immediately followed by an LRM when the embedding level around the phrase
> is even (LTR) or an RLM when it is odd (RTL), to prevent a number or an
> unrelated opposite-direction phrase following the insert from "sticking" to
> it. The principal difficulty in implementing this is that often the code
> layer doing the insertion has no idea what the embedding level at the
> insertion point is. The LDM would address this need; IMO it is the most
> important use case for it.
>
> Under the new proposal, the way to deal with the opposite-direction phrases
> is to put them in an isolate. LRM and RLM - and thus LDM - are not
> necessary. Furthermore, this way to deal with opposite-direction inline
> inserts is more robust, because it works even when the insert is surrounded
> by a phrase whose direction is opposite to the embedding level, but whose
> direction is not explicitly declared. Of course, it would be better if the
> direction of every opposite-direction phrase were declared, but often that
> is not the way that bidi text is constructed. In such cases, an LDM (or
> LRM|RLM) disrupts the phrase surrounding the insert.
>
> I believe that the use cases cited for the LDM can also be achieved with
> isolates. For example, "An Arabic numeric date of the form dd/MM/yyyy in
> which the fields should flow left-to-right (e.g. 09/16/2011) in a
> left-right context (i.e. the date and perhaps some other Arabic text are in
> a mainly Latin-script paragraph), but should flow right-to-left (e.g
> 2011/16/09) in a right-left context (e.g. a primarily Arabic-script
> paragraph)" can be achieved by putting each of the numbers (day, month,
> year) in an a separate isolate, e.g. FSI09PDF/FSI16PDF/FSI2011PDF.
>
> However, these are two independent proposals that do not actually conflict,
> and you might want to get the opinion of the LDM's proposers :-)
>
> Aharon
>
> On Tue, May 15, 2012 at 10:24 AM, Aharon (Vladimir) Lanin <
> aharon@google.com> wrote:
>  I will reply substantively after taking www-style off the recipients. I
>  don't think that the CSS list is the right place to discuss the details
>  of the Unicode proposal.
>
>  Aharon
>
>
>  On Tue, May 15, 2012 at 10:10 AM, "Martin J. Dürst" <
>  duerst@it.aoyama.ac.jp> wrote:
>   On 2012/05/15 5:06, Aharon (Vladimir) Lanin wrote:
>     Last week, I wrote up and Mark Davis submitted to the UTC a proposal (
>     http://goo.gl/K6qtV) for adding bidi isolation to Unicode. Here is the
>     basic proposal:
>
>     --- start quote ---
>     Define three new Unicode formatting code points:
>     LRI: marks the beginning of a left-to-right isolate.
>     RLI: marks the beginning of a right-to-left isolate.
>     FSI: marks the beginning of a first-strong isolate.
>
>     Each would be matched with a PDF.
>
>   It may be worth considering to create a new character to close these
>   embeddings. Otherwise, older algorithms will close LRE/RLE/LRO/RLO
>   embeddings/overrides prematurely.
>
>   Another question: What's the relationship between this proposal and the
>   new bidi control character that was proposed (I think by Apple) around
>   last November's UTC?
>
>   Regards,   Martin.
>
>
Received on Tuesday, 15 May 2012 14:22:34 UTC