W3C home > Mailing lists > Public > www-international@w3.org > January to March 2014

Re: [inline bidi update] - Some comments

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Mon, 24 Feb 2014 12:35:24 +0200
Message-ID: <CA+FsOYYXTRpGJ+ryoRgfjbNGUtJFrdHpFkJ17DWTt04vxjf9Sg@mail.gmail.com>
To: Lina Kemmel <LKEMMEL@il.ibm.com>
Cc: "www-international@w3.org" <www-international@w3.org>
Thank you for your insightful comments!

See below.

On Tue, Feb 4, 2014 at 3:09 PM, Lina Kemmel <LKEMMEL@il.ibm.com> wrote:

> Hello Richard, Aharon et al,
> Please find below some comments on the article.
> 1. "If the text to be marked up is tightly wrapped by a non-inline element
> you would usually add the dir attribute to that element. In some cases
> this can lead to the block of text being aligned on the page in a way that
> is not desirable. To avoid this, you can add an inline element immediately
> inside the tags of the existing markup ..."
> Comment: Alternatively, mismatching direction and alignment in a
> block-level element can be achieved by specifying both dir and align
> attributes, for example: <p dir=rtl align=left>ABC...</p>.
> This would cause the paragraph content to have right-to-left base
> direction and be aligned to the left.

The align attribute is already deprecated in HTML4 and should not be used.
Alignment can of course be fixed using CSS (text-align), but it has issues
of its own. To assign the correct text-align value, one has to know the
inherited text-align (e.g. one does not want to override 'center' with
'left' or 'right') and directionality. It is easier to use an inline
element as recommended in the article.

> 2. "if the tightly-wrapped phrase in the previous step is followed inline
> (possibly after some intervening neutral characters) by a number or a
> logically separate opposite-direction phrase, then add a directional mark
> (RLM or LRM) immediately after the markup of that phrase. " [referring to
> HTML4]
> Comment: It can be necessary to add a directional mark also before the
> markup of the phrase to be isolated.
> For example, in an LTR paragraph, an RTL phase to be isolated is dropped
> from a database with a directional markup added, but the preceding RTL
> phrase doesn't contain such a markup. If the relative order of the 2
> successive RTL phrases should be preserved in display (to follow the LTR
> base text direction), an LRM character should be inserted before the
> injected phrase.
> <p>the concatenation of tokens is: RTL-TEXT * <span dir=rtl>
> RTL-INJECTED-TEXT</span></p>
> What you'd see without LRM is:
> the concatenation of tokens is: TXET-DETCEJNI-LTR * TXET-LTR
> What you'd expect to see:
> the concatenation of tokens is: TXET-LTR * TXET-DETCEJNI-LTR
> This is accomplished by adding an LRM before the injected phrase:
> <p>the concatenation of tokens is: RTL-TEXT * &lrm;<span dir=rtl>
> RTL-INJECTED-TEXT</span></p>

The assumption behind the recommendation to put a mark after the
opposite-direction phrase (and not before it) is that *every* phrase gets
treated this way. Once that is the case, the mark before the phrase is
unnecessary, since the preceding opposite-direction phrase has a mark after

Nevertheless, Lina's suggestion does have some merit, since an RTL phrase
in an LTR context can theoretically be preceded by a native-Arabic number,
and will "stick" to it unless an LRM comes between them. (Note, however,
that an opposite-direction phrase will *not* "stick" to a preceding
"European" number, even without a mark between them. It will "stick" to a
"European" number following it, but the mark after it prevents that.) Thus,
to deal with the native-Arabic case, we either have to put a mark before
opposite-direction phrases or to treat native-Arabic numbers as
opposite-direction phrases in an LTR context.

I only recently realized that the native-Arabic number case is problematic
in this respect. It has not come up because it is very rare for
native-Arabic numbers to be used in an LTR context.

> ============
> 3. General comment on changing the dir semantics in the HTML standard
> itself. In HTML5 dir actually duplicates BDI (which seems to be
> redundant), and there is no markup to get back to the old behavior
> (LRE/RLE ... PDF equivalents).

BDI is pretty much redundant, except as an easy-to-use shorthand. It was
introduced when we were trying to keep backward compatibility.

After much anguish over a space of two years, we eventually decided that
backward compatibility had to be sacrificed because:
- It is more important to get the right behavior going forwards
- It is impossible to keep backward compatibility without having to use
awkward syntax to get the right behavior. (And BDI is in fact awkward in
various circumstances.) We tried three or four different ways to do that -
but none of them worked.
- IE8 already changed the behavior of dir, despite any spec to do so, to
something resembling isolation. This is still the case through IE11.
Surprisingly, the change did not result in much breakage. Thus, we felt
that the price of losing backward compatibility was not too high.

One can get the old behavior using CSS, either by rules:

[dir='ltr'][dir='rtl'] {unicode-bidi:embed}
[dir='auto'],bdi[dir] {unicode-bidi:isolate}
bdo[dir] {unicode-bidi:override]

or inline directly on the element: <span dir="..."

> The use case as above, but when the
> concatenated fragments are expected to flow from right to left:
> What you'd expect to see:
> the concatenation of tokens is: TXET-DETCEJNI-LTR * TXET-LTR

When one needs two or more injected opposite-direction phrases to flow in
the opposite direction overall, one should add an element around them with
that opposite dir attribute to explicitly declare it. Relying on the
phrases to automatically stick to each to get that overall order without
declaring it explicitly has several problems that argue against ever using
that approach:
- The overall ordering between the two is an important part of the page's
design and should be reflected in explicit mark-up
- The approach completely breaks down when there are three or more phrases
and they may have different directionalities.
- The approach does not work in IE8+.

This whole spec issue is not talked about here because this article is not
the place to do it. It was discussed in various places and all these
factors were considered before the spec was changed.

> ============
> 4. "dynamic use cases..."
> Comment: Again, a comment on the standard itself. A non-neglectable
> "dynamic" case is editable text. Currently, the standard doesn't address
> inline formatting (bidirectional embeds, isolates, overrides) in editable
> text.

An editor built around contenteditable could introduce the mark-up talked
about in this article into the text while it is being entered, either in
response to specific user requests to do so, or by default by trying to
guess the user's intentions, but that is indeed way beyond the scope of
this article.

Of course, one could argue that the contenteditable spec itself should
contain features that would make it easier to implement such an editor, but
the contenteditable spec has bigger problems than bidi, and little work has
been done to figure out what exactly such features might be. I should add
that I am not aware of any editor that allows the user to explicitly set
and change inline directionality. Word sets it by default for neutral
characters according to the directionality of the keyboard language used to
enter the character, but does not allow the user to set it or change it
explicitly. I believe that the reason that no editor has this feature is
that no one has figured out a way to do it that does not wind up being too
confusing to use.

> 5. "There are some situations where you may not be able to use the markup
> described in the previous section. In HTML these include the title element
> and any attribute value.
> In these situations you have to use the invisible Unicode characters that
> produce the same results..."
> Comment: For perfect isolation, one should enclose an embedded phrase in 2
> pairs of characters (unless RLI, LRI, FSI, PDI are supported). The first
> pair of characters consists of one of U+200E LEFT-TO-RIGHT MARK (LRM) or
> U+200F RIGHT_TO_LEFT MARK (RLM) [choose the one consistent with the base
> text direction] AND one of U+202B RIGHT-TO-LEFT EMBEDDING (RLE) or U+202A
> LEFT-TO-RIGHT EMBEDDING (LRE) [choose the one to match the desired
> embedded phrase direction]. This corresponds to the markup <span
> dir="rtl"> or <span dir="ltr">. The second pair of characters consists of
> MARK (LRM) or U+200F RIGHT_TO_LEFT MARK (RLM) again. That being said,
> LRM/RLM may be not mandatory in certain contexts (which applies also to
> LRE/RLE or corresponding markup BTW).

You are quite right: the article should also recommend using the LRM or RLM
(as it did in the HTML4 markup approach). Once again, I believe that an
LRM/RLM after (and only after) every opposite-direction phrase is
sufficient in the vast majority of cases (the exception being when native
Arabic numbers can appear in an LTR context).

> Regards,
> Lina Kemmel
>   Bidi architect
>   IBM
Received on Monday, 24 February 2014 10:36:12 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:41:04 UTC