bdi definition (Additional requirements for Bidi in HTML, sections 2.1, 3.1, 3.3)

 This is a somewhat long (but I hope also somewhat amusing) attempt to deal
with a number of interrelated issues that have arisen in connection to the
bdi (BiDirectional Isolate) attribute proposed by section 2.1 of "Additional
Requirements for Bidi in HTML" (<
http://www.w3.org/TR/html-bidi/#bidi-isolation>).

(The name bdi is likely to be replaced with something more meaningful. This
is a separate issue that I am ignoring here.)


-- Recap

The bdi attribute is currently defined as making an inline element
directionally isolated from its surroundings by making it behave as if it
were surrounded with strong-directional characters of the last explicit
embedding level within which it appears. For example:

<div dir=ltr>
<span dir="rtl" bdi="yes">PURPLE PIZZA</span> - 3 reviews
</div>

would be displayed the same as

<div dir=ltr>
&lrm;<span dir="rtl">PURPLE PIZZA</span>&lrm; - 3 reviews
</div>

i.e. as

AZZIP ELPRUP - 3 reviews

and not as "3 - AZZIP ELPRUP reviews", which is the case currently, without
bdi.

The proposed definition also includes balancing any missing and extra PDF
characters in the content.

Now, the core issue:


--- Separation and isolation

The current definition directionally separates the bdi element not only from
the text around it, but also the text before it from the text after it. The
latter makes a difference when both the text before and the text after the
bdi element have the opposite direction (implicit or explicit) than the last
explicit embedding level - e.g. the direction of the parent element they all
share - and neither has bdi. The bdi element prevents the text before and
the text after it from combining into a single implicit directional phrase.

In the past, Mati and I discussed, and fantasai later independently
suggested, an alternate definition that avoids directionally isolating the
text before the bdi from the text after it. Instead of surrounding the bdi
element with imaginary strong-dimensional characters, it puts the text
inside the element in a separate bidi paragraph, thus isolating it from its
surroundings, and then treats the whole bdi element as a neutral character
from the point of view of the surrounding text.

This (or something very close to it) is what already happens today in all
major browsers with an <input> element: its value is displayed unaffected by
the text around it, and the <input> is treated by its surrounding text as if
it were neutral. Thus, "MAN <input value='bites'> CAT" is displayed as "TAC
[bites   ] NAM" whether it is in a <div dir=ltr> or a <div dir=rtl>, i.e.
the "MAN" is allowed to stick to the "CAT" right across the <input>. Under
the new definition of bdi,  "MAN <span dir=ltr bdi>bites</span> CAT" would
also be displayed as "TAC bites NAM" whether in a <div dir=ltr> or in a <div
dir=rtl>. Under the original one, it would come out as "NAM bites TAC".

I would like to label the new definition as *isolation* and the original
definition as *separation*.


--- Current browser usage

It turns out that <input> and <textarea> elements are not the only place
where the latest versions of all major browsers do isolation, or something
very much like it. (I have tested Firefox 3.6.3, IE 8, Safari 4.0.5, Chrome
5.0, and Opera 10.53) There are also the following:

* style that takes the element out of the flow, e.g. float:left|right and
position:absolute.
* display:inline-block

Firefox and Opera also treat block elements generally (display:block
elements, to be precise) with isolation. However, Mozilla engineers have
already agreed in this forum that this is a bug. A full UBA paragraph break
as in the other browsers is the correct way to go.

Separation, on the other hand, has almost no precedent in today's browsers.
The only exception is an embedded block element with display:inline, which
until recently all browsers treated as a normal inline element, with no
separation or isolation. Firefox recently discovered that this is not
according to spec, and changed it to use separation. I am not sure if using
separation vs isolation was a conscious choice or what considerations went
into it.

*** Mozilla engineers: could you shed some light? ***

It is quite clear that isolation is a very natural choice for <input>,
<textarea>, and floating and absolutely-positioned elements. Nothing else,
including separation, makes much sense for them. But then again, they are a
different phenomenon than the bdi attribute: *we would not want to allow the
user to turn isolation off for them with bdi=no.*

With display:inline-block text appearing in a separate block box, it
probably does not make much sense to allow the user to turn isolation off
for it either. And the theoretical question of whether separation might work
somewhat better than isolation for it is moot in the absence of a clear need
for change: we would not want to disturb backward compatibility (or the
current browser interoperability).


--- Should bdi do isolation?

Thus, the question arises: given the browsers’ current preference for
isolation, perhaps we should use isolation for bdi too?

Some reasons for and against that are pretty obvious:

* Pro: isolation is a simpler, more intuitive, and more easily stated
definition than separation.
* Pro: isolation for bdi would make its behavior consistent with the cases
where the browser already uses isolation, making browser behavior that much
easier to understand and predict.
* Pro: isolation avoids the possibly difficult-to-implement case of a bdi
element coming between an LRE/RLE/LRO/RLO and its matching PDF.
* Con: if bdi does not do separation, section 2.1 of the proposal (the <br>
conundrum) no longer works.
* Con: isolation does not have a plain-text equivalent in the Unicode Bidi
Algorithm. You can't "isolate" a string using Unicode formatting characters.
* Con: as a result of the preceding item, isolation is probably
significantly harder to implement, and may carry significantly more
processing overhead when dealing with ordinary inline text.

***Reality check***: can anyone who has actually implemented bidi in browser
text processing weigh in on the extent to which the last item is true?

Of course, we could even allow the author to choose between separation and
isolation, by changing bdi’s value repertoire to something like
none|bdi|isolate|separate, where bdi would be a synonym for "separate". We
could even throw in something like "paragraph" or "break", to indicate a
full UBA paragraph break. The differences between their behavior are so
fine, however, that this does not seem very desirable.

Let us examine the effect of those fine differences on the way bdi would
work in text, trying to find any arguments for one or the other.


--- Isolation vs separation in text

Let's start with

<div dir=ltr>
read "DEAR <span dir=ltr bdi>john</span> AND SUSAN" today!
</div>

Under separation it would be displayed as:

read "RAED john NASUS DNA" today!

This is not very good. Under isolation, however, we would get:

read "NASUS DNA john RAED" today!

This certainly makes more sense. And the effect does not depend on the bdi
attribute being combined with dir. Under separation,

<div dir=ltr>
read "DEAR <span bdi>JOHN</span> AND SUSAN" today!
</div>

would be displayed as

read "RAED NHOJ NASUS DNA" today!

This is even worse than before, but isolation would still fix it.

So, do we have an argument for isolation?

Not necessarily. Opposite-direction phrases, such as the whole "DEAR ...
SUSAN" quote in our examples above, should really be surrounded in an
element that declares their direction. If they are not, as our quote isn't,
they are often displayed garbled. The garbling is most severe when they
contain opposite-direction inserts like our "john" above. If the whole "DEAR
... SUSAN" quote were surrounded in a <span dir=rtl>, as it should be, it
would of course come out as intended, with or without the bdi on "john" or
"JOHN".

Also, when an opposite-direction phrase contains arbitrary-direction bdi
inserts, we are talking about more than one level of logical embedding. This
is not a very common occurrence – as the rather forced character of our
example above testifies.

And it is certainly possible to make up other examples where one *does* want
to separate the text before the bdi element from the text following it,
e.g.:

<div dir=ltr>
i spoke to JOHN. <span dir=rtl bdi>SUSAN</span>, MIKE and ollie spoke to him
too.
</div>

Under separation, this comes out as:

i spoke to NHOJ. NASUS, EKIM and ollie spoke to him too.

This seems as good as it's going to get.

Under isolation, one the other hand, we get the very misleading

i spoke to EKIM ,NASUS .NHOJ and ollie spoke to him too.

It would be even more misleading if instead of "<span dir=rtl
bdi>SUSAN</span>", we had "<span dir=ltr bdi>susan</span>". Under isolation,
it would come out as:

i spoke to EKIM ,susan .NHOJ and ollie spoke to him too.

So, do we have a strong argument for separation?

No, it is also flimsy. "JOHN" and "MIKE" in our example need bdi no less
than "SUSAN". Without it, we can expect them to garble their surroundings.
With it, the example comes out as intended, whether we use separation or
isolation. Why would the author use bdi on one insert, but not the other
two? Also, having the coincidence of separate opposite-direction phrases
around the bdi element does not seem like a common occurrence either.

Nevertheless, this argument is perhaps less flimsy than the one for
isolation above. In web apps, different parts of the document are produced
by different layers of code. One layer may be using bdi; another might not.
In fact, the layer not using <bdi> may not be easily capable of doing so,
perhaps because it is limited to plain text.

It is worthwhile pointing out that there is no third kind of case. For
isolation and separation to differ, one needs opposite-direction text on
both sides of the bdi. It either makes up one logical phrase, or it doesn't;
there is no third choice.

If anything, we seem to have a weak argument for separation over isolation.


--- bdi by default

A number of cases have been proposed where an element should have bdi *by
default*, usually since it makes no sense to let it affect and be affected
by what surrounds it. Here is a list:

* dir=auto elements (section 2.2)
* <a> (najib)
* <br> (section 3.1)
* block elements with inline display (fantasai and others)
* display:inline-block elements (section 3.3)

Would these work better with "separate" or "isolate"?

* dir=auto: weak preference for separation

The dir=auto case is no different than the explicit <span dir=ltr|rtl bdi>
cases considered above, where we have identified a weak preference for
separation.

* <a>: do not change current behavior

As with the explicit <span dir=ltr|rtl bdi> cases considered above, implicit
bdi on <a> would work better under isolation when the <a> is in the middle
of a coherent, undeclared opposite-direction phrase, but better under
separation when the <a> is both preceded and followed by opposite-direction
text that does not form a single logical phrase and does not use bdi on
either side.

I my opinion, however, such considerations are moot, since <a> should *not*
become bdi by default. There is simply too much danger that it will break
existing documents. This will happen whenever the link is part of an
undeclared opposite-direction phrase, that either begins or ends with it,
e.g.:

<div dir=rtl>"click <a>here<a>" IS NOT THE WAY TO DO LINKS.</div>

Currently, it is displayed as intended:

SKNIL OD OT YAW EHT TON SI "click *here*"

With bdi turned on for <a> by default, however, it would come out garbled
regardless of whether we use isolation or separation:

SKNIL OD OT YAW EHT TON SI "*here* click"

One could try to argue that turning bdi on for default for <a> could also
fix some current documents, but it does not seem likely: if the document is
currently displayed garbled, the author would probably fix it. (This does
not apply to another case where we did suggest changing current behavior,
i.e. applying the direction of the parent of the <title> element to it,
because few authors have any idea how to fix its display and the current
behavior is unreliable anyway.)

* <br>: separation

As stated in 3.1, we want <br> to offer directional separation by default,
while allowing for an option to disable it. The proposed solution is to make
it bdi by default - but this only works if bdi uses separation, not
isolation.

However, in the absence of separation, can we deal with <br> by defining it
as a full UBA paragraph break? After all, in practice, there is very little
difference between separation and a full UBA paragraph break.

(One difference is that separation does not terminate the effects of LRE,
RLE, LRO, and RLO. However, the use of these characters is discouraged
wherever mark-up can be used. Another difference is that for an inline
element like <br> that can appear nested in any number of other inline
elements, each with its own dir, separation is much easier to define and
implement than a full UBA paragraph break. However, this does not seem to be
a big consideration either, since a reasonable implementation for a UBA
paragraph break inside an inline element has been described in 3.1, and
seems to be the right thing to do for <br> in <pre> anyway, and there is no
avoiding using it for inline elements with display:block, as mentioned in
3.3.)

Unfortunately, defining <br> as a UBA paragraph break goes against past W3C
and Unicode Consortium decisions. Worse, it would not allow a way to get a
non-separating <br> when necessary.

Another possibility is to add another element like <br>, with one including
a UBA paragraph break, and the other behaving according to the current spec.
Clearly, this is not a very attractive solution either...

Thus, for <br>, we really do want bdi to do separation.

* block elements with inline display: separation

First a bit of history. The HTML 4 spec says (
http://www.w3.org/TR/REC-html40/struct/dirlang.html#style-bidi):

"When a block element that does not have a dir attribute is transformed to
the style of an inline element by a style sheet, the resulting presentation
should be equivalent, in terms of bidirectional formatting, to the
formatting obtained by explicitly adding a dir attribute (assigned the
inherited value) to the transformed element."

Currently, the only browser that implements this is Firefox. The others
treat it as any other display:inline element, with no separation or
isolation. Clearly, then, it would be beneficial to have its behavior be
subject to bdi, since both separation and no separation are viable
behaviors.

As pointed out by Martin Dürst, the reason the spec was formulated that way
is that there are block element that provide handy formatting that is not
available in any inline element. Sometimes, however, the author might want
that effect on a single line. An example would be to use <ol
style="display:inline"> to get an inline numbered list: "1) apple 2) orange
3) pear". To get that effect in the presence of some opposite-direction
text, one needs the same bidi behavior as one had for the block without
display:inline, or as close as possible to it. The current definition
attempts to do that, but clearly does not go far enough.

The original behavior for a block element is full UBA paragraph breaks.
Separation is clearly much closer to that than isolation. And Firefox does
use it – even though that is not currently in accordance with the spec.

Backward compatibility is not an issue because no current browser behavior
matches the spec anyway.

We therefore conclude that defining bdi to use separation is preferable for
display:inline block elements, for which bdi would be on by default.

* display:inline-block elements: do not change current behavior

Unaware that display:inline-block already uses isolation in all the
browsers, section 3.3 of the proposal suggested making it bdi by default.

As noted above, changing it to use isolation would be problematic because it
would break backward compatibility (and, at least for the short term,
browser interoperability). However, since display:inline-block text appears
in a separate block box, it probably does not make sense to allow the user
to turn isolation completely off for it anyway, so we do not really want to
make it subject to bdi. Thus, bdi can be explicitly define to apply
exclusively to display:inline elements (and perhaps to display:runin
elements when they behave like display:inline ones).

Thus, with <br> and <display:inline> on block elements strongly arguing for
separation, and with no strong usage argument for isolation, it is my
opinion that bdi should use separation.


--- The fallout

In light of the above, I would like to suggest the following open issue
resolutions:

2.1.c: No change. The definition of bdi=yes|bdi will stay as specified in
the proposal.

2.1.d: No change. Neither <a> nor any other elements in addition to the <br>
specified in the proposal will have bdi=yes by default.

3.3.a:

   1. Inline elements with display:block style will be treated like ordinary
   block elements, i.e. serve a UBA paragraph breaks between the text preceding
   and following them.
   2. Block elements with display:inline style will have bdi=yes by default.



3.3.b: The bdi attribute will have no effect on elements with display that
is neither inline nor runin acting as inline.

3.3.c: The bdi attribute will have no effect on elements with
float:left|right. They will continue to be treated as separate UBA
paragraphs removed from their context, as they are today.

3.3.d: The bdi attribute will have no effect on elements with
position:absolute. They will continue to be treated as separate UBA
paragraphs removed from their context, as they are today.


Aharon

Received on Sunday, 6 June 2010 15:51:26 UTC