RE: IRIs and bidi: Addition regarding higher-level protocols from Martin Duerst on 2004-02-12 (public-iri@w3.org from February 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 12 Feb 2004 16:32:38 -0500
To: "Michel Suignard" <michelsu@windows.microsoft.com>
Cc: <public-iri@w3.org>, "Mark Davis" <mark.davis@jtcsv.com>, bidi@unicode.org
Message-Id: <4.2.0.58.J.20040212115749.04297e00@localhost>
Hello Michel,

Many thanks for yor text. I have taken a different way. The new text
now reads:

<<<<<<<<
    When rendered, bidirectional IRIs MUST be rendered using the Unicode
    Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
    rendered in the same way as they would be rendered if they were in an
    left-to-right embedding, i.e.  as if they were preceded by U+202A,
    LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
    DIRECTIONAL FORMATTING (PDF).  Setting the embedding direction can
    also be done in a higher-order protocol (e.g.  the dir='ltr'
    attribute in HTML).

    There is no requirement to actually use the above embedding if the
    display is still the same without the embedding.  For example, a
    bidirectional IRI in a text with left-to-right base directionality
    (such as used for English or Cyrillic) that is preceded and followed
    by whitespace and  strong left-to-right characters does not need an
    embedding.  Also, a bidirectional relative IRI that only contains
    strong right-to-left characters and weak characters and that starts
    and ends with a strong rigth-to-left character and appears in a text
    with right-to-left base directionality (such as used for Arabic or
    Hebrew) and is preceded and followed by whitespace and strong
    characters does not need an embedding.

    In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM) may be
    sufficient to force the correct display behavior.  However, the
    details of the Unicode Bidirectional algorithm are not always easy to
    understand.  Implementers are strongly advised to err on the side of
    caution and to use embedding in all cases where they are not
    completely sure that the display behavior is unaffected without the
    embedding.

    The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits
    higher-level protocols to influence bidirectional rendering.  Such
    changes by higher-level protocols MUST NOT be used if they change the
    rendering of IRIs.

    The bidirectional formatting characters that may be used before or
    after the IRI to assure correct display are themselves not part of
    the IRI.  IRIs MUST NOT contain bidirectional formatting characters
    (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).  They affect the visual
    rendering of the IRI, but do not themselves appear visually.  It
    would therefore not be possible to correctly input an IRI with such
    characters.
<<<<<<<<


The old text read:

 >>>>>>>>
    When rendered, bidirectional IRIs MUST be rendered using the Unicode
    Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
    rendered with an overall left-to-right (ltr) direction.  The Unicode
    Bidirectional Algorithm ([UNI9], Section 4.3) permits higher-level
    protocols to influence bidirectional rendering.  Such changes by
    higher-level protocols MUST NOT be used if they change the rendering
    of IRIs.

    In text with a left-to-right base directionality or embedding (such
    as used for English or Cyrillic), the Unicode Bidirectional Algorithm
    will automatically use an overall ltr direction for the IRI.  In text
    with a rtl base directionality or embedding (such as used for Arabic
    or Hebrew), setting a different embedding direction for the IRI is
    needed.  Setting the embedding direction can be done in a higher-
    order protocol (e.g.  the dir='ltr' attribute in HTML).  If this is
    not available (e.g.  in plain text), setting the embedding is done
    with Unicode bidi formatting codes, i.e.  U+202A, LEFT-TO-RIGHT
    EMBEDDING (LRE) before the IRI, and U+202C, POP DIRECTIONAL
    FORMATTING (PDF) after the IRI, both not being part of the IRI
    itself.

    IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM,
    LRE, RLE, LRO, RLO, and PDF).  They affect the visual rendering of
    the IRI, but do not themselves appear visually.  It would therefore
    not be possible to correctly input an IRI with such characters.
 >>>>>>>>


There are several changes, in particular:

- Making clear that the required display behavior is that of an ltr
   embedding (not just ltr base directionality).
- Tightening the case(s) that don't actually need the embedding
   to avoid the cases that were wrongly included, as found by Michael.
- Describing a case where no embedding is necessary in a purely
   rtl context (what Jony was looking for).

The rest is mostly just moving things around a bit. Please check and
tell me if I have missed something.


At 17:04 04/02/11 -0800, Michel Suignard wrote:
>Martin, here is my new proposed text (in quotes) for replacement ofn the
>2nd paragraph of clause 4.1:
>
><<
>When rendered, bidirectional IRIs MUST be rendered using the Unicode
>Bidirectional Algorithm [UNIV4] [UNI9] with an overall left-to-right
>(ltr) direction.
>To achieve this, the IRI is embedded left-to-right in
>all the following cases:
>1. If the current embedding level before the IRI is odd (right-to-left)
>2. If the last character with a strong directionality before the IRI is
>right-to-left
>3. If the first character with a strong directionality after the IRI is
>right-to-left.

I think these three conditions would cover all the necessary cases,
but they would also force embedding in Jony's case, which is not
necessary and which I wanted to avoid.


>No additional bidirectional rendering change by higher-level protocols
>is allowed.
>
>Note: Embedding the IRI left-to-right can be achieved by embedding the
>text with LRE...PDF. If the maximum allowed embedding level is exceded
>(above 62), the IRI overall left-to-right direction may not be enforced.
> >>

I prefer not to mention the 62 levels case. It is part of the bidi
algorithm, and the limit is set so high that it shouldn't affect
anything but pathological cases anyway.




>The small diagramm (to be seen in monospaced chars) shows the desired
>result
>
>-String before-|  IRI  |-String after--
>               L    ON   L
>(For the string before and after, the IRI behaves as bidi 'ON')

I'm not actually sure that that's possible with an embedding.
For example in rule W1 in the bidi algorithm, we have

sor NSM -> sor L
(assuming sor is L)

A sor of L could result from a closing of an ltr embedding.
So if I understand the way to calculate sor/eor correctly,
the IRI would appear as L to the surroundings.


>(For the
>IRI itself, string before and after behave as bidi 'L')

That I think is correct.


>BTW I am interpreting clause W2 of the Unicode Bidi algorithm concerning
>the strong type enumeration as including as well the embedding
>characters (at least the LRE) as it is necessary in the logic expressed
>above.

Yes. That's expressed by the sor, which would be L in the case of
starting an ltr embedding.


Regards,   Martin.


>I have tried one of the sample bidi algorithm (Asmus Freytag
>version) and it behaves that way.
>
>Michel
Received on Thursday, 12 February 2004 17:10:10 UTC