Re: bidi proposal from Aharon (Vladimir) Lanin on 2010-03-17 (public-i18n-bidi@w3.org from January to March 2010)

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Wed, 17 Mar 2010 16:05:53 +0200
To: public-i18n-bidi@w3.org
Message-ID: <6b45e1b51003170705r558ebe3am5644eb5c1500b062@mail.gmail.com>
Finally getting to repost my reply. Sorry for the delay.


> bdi=yes
> # The element, even when empty, is to be displayed as if it were
surrounded
> # with strong-directional characters of the last explicit embedding level
> # within which it appears.
>
> Why not treat the element as U+FFFC instead? If the intent is isolation so
> that it doesn't affect surrounding text, wouldn't treating it as a neutral
> would make more sense than treating it as a strong character?

Mati and I considered this approach, but were less happy with it for two
reasons:
1. It may prove difficult to implement.
2. The strong character approach not only isolates the element from its
surroundings, and the surrounding from the element, but also isolates what
comes before the element from what comes after the element. This is a useful
property; for example, it is what makes it possible to make <br> a bidi
separator by making it <br bdi="yes"> by default.

> dir=auto
> # Make simple direction estimation functionality available in the browser
by
> # allowing the dir attribute to take on new values indicating that the
user
> # agent is responsible for estimating the direction of the element's
contents.
> # One such dir attribute value would specify using the word-count
algorithm,
> # defined and discussed in Appendix A. Another would specify the
first-strong
> # algorithm, as defined by the UBA.
>
> The Mozilla devs I talked to are skeptical that authors would know which
> algorithm to choose.

This is indeed an issue, and the primary reason that we have left it open
whether to expose two different algorithms.

> Also, scanning the entire text has performance
> implications, especially for large elements

True, but:
- This is not done by default for anything; the page explicitly requested
it.
- We are considering putting a limit on the number of characters or perhaps
the number of words scanned (see "possible" tweaks at end of appendix)
- Even if the whole text is scanned, the amount of processing is O(n); the
browser will then spend at least O(n) time preparing the text for display by
the underlying platform, probably with a larger factor.

> and for when there are DOM
> mutations involved.

The estimation only needs to be done when the browser decides to render the
page.

> We suggest considering the following:
>  - Of the first 64 characters after and including the first strong
>    character, if any (or some low percentage) are strong RTL,
>    consider the element's computed base direction to be RTL, else
>    LTR.

This is basically the any-RTL algorithm, and I think is generally less
useful than either first-strong or word-count. It fails on casual LTR text
"peppered" with some RTL words (e.g. a chat between expats from RTL
countries), as well as on scholarly LTR text that uses some RTL words for
precision (e.g. a discussion on biblical topics).

Coming back to the issue of user confusion between two different estimation
algorithms, there is, of course, the option of offering just the word-count
or just the first-strong (or just any-rtl, which, as I said above, Ioppose).
I know that several of us would be quite unhappy with just first-strong. And
I also know that there at least used to be (but I am not 100% certain there
still are) several of us who would be quite unhappy with just word-count.

I have suggested another alternative in the past, but did not merit any
replies at all :-(. Perhaps I should give it another try. Basically, the
idea is to unify first-strong and word-count under the disguise of giving
the page author control over how much of the content is scanned:

- The syntax for the dir value is "ltr|rtl|auto[0-9]*" or some more
palatable version.
- All the auto values always use word-count - but stop after scanning the
first n strongly-directional words. Thus, by using a number, the page author
specifies how thorough a job the estimation should do.
- As a result, "auto1" is almost exactly identical to first-strong! The
exception is that "weak ltr" values, e.g. "(212) 123 4567", wind up being
treated as LTR, which is a good thing. Thus, we wind up exposing
first-strong as well as word-count.
- Plain "auto" is a synonym for either "auto1" or some likely word-county
value, e.g. "auto64" - TBD.


> # One possibility for such a specification would be with a new HTML
attribute:
> # hflip="no|yes|ltr|rtl".
>
> Is there a use case for the 'yes' value?

I originally suggested it just for completeness, but it turns out to be
generally useful. A completely LTR and bidi-unaware page may need to use
both left- and right-arrows or some other pair of mirror images, and using
hflip=yes beats maintaining a separate image.

> # Just as with <br>, in Firefox and Opera, an embedded block element
provides
> # no bidi separation between the text preceding and following it, while IE
and
> # WebKit treat it as a UBA paragraph break. ... The text before and after
a
> # block element is said to form "anonymous blocks", and it is well
accepted
> # that blocks should constitute UBA paragraphs.
>
> Yes, I would consider this a bug in Firefox and Opera: their behavior is
> clearly violating the CSS spec.
>
> A related problem is block elements that are rendered as display: inline.
> The CSSWG has an issue filed on having these default to unicode-bidi:
embed
> in our sample HTML4 style sheet. (Looking at your document, they might
also
> need bdi=yes.)

If I remember correctly, all browsers until very recently treated text in a
<div style="display:inline"> as if it were in a <span>. I now see that
Firefox 3.6 has indeed broken ranks and is now treating it as if it were a
<span style="unicode-bide:embed">. I am not sure what the rationale for this
is. IMO, using unicode-bidi:embed without specifying direction, i.e.
defaulting to the parent's direction, is a technique in search of a purpose.
I think I liked the old behavior better.

I agree that it would be best to come up with an explicit spec of what the
expected bidi behavior is in all these strange combinations of element type
and display type.

>
> ====== Editorial comments ======
>
> # HTML, the UBA is
>
> s/HTML/In HTML/ ?
>

Yup, already done.


> # This is because text displayed in the wrong direction is often garbled.
>
> s/wrong direction/wrong base direction/

Good suggestion.

> # and "MAKE html WORK FOR YOU" is displayed in LTR as
> #
> # EKAM html UOY ROF KROW
> #
> # instead of the intended
> #
> # UOY ROF KROW html EKAM
>
> I suggest having the indented output be MAKE html WORK FOR YOU,
> and rearranging the previous two strings as required. I think
> that might get the point across a little better. :) But I suppose
> the disadvantage is that the source order is no longer in
> logical order.

Probably the best solution is to make this an LTR string, i.e.:

and "please tell JOHN to go to school" is displayed in RTL as

to go to school NHOJ please tell

instead of the intended

please tell NHOJ to go to school

> In 2.1:
>
> # The UBA's rendering of a piece of text depends not only on the
> # explicitly declared direction in which it appears (e.g. the dir
> # attribute value on the parent element)
>
> s/explicitly declared direction/base directional context/ or somesuch

It should probably be just "base direction"

> s/the dir/as set by the dir/

Righteous.

> # The bidi formatting characters LRO, RLO, LRE, RLE, and PDF have
> # particularly strong influence on what surrounds them.
>
> This is somewhat overstated and vague. I would suggest something like
>
> | The bidi formatting characters LRO, RLO, LRE, RLE, and PDF can
> | fine-tune the bidi algorithm by either overriding the implicit
> | directionality of characters (LRO, RLO) or creating an embedded
> | base directional context (LRE, RLE)

I'd rather stay vague and not get into the specific effects of these
characters, which has the price of too much detail on a relatively minor
point. I am not sure why you think that the current formulation is
overstated, though. I think that these characters indeed do have a
particularly string influence. For example, in an LTR context, putting an
RLE before "please tell JOHN to go to school" changes its display to "to go
to school NHOJ please tell"; putting a simple strong RTL character, e.g.
RLM, has no effect at all. But in all examples where RLM does have an
effect, RLE will have at least as much effect. RLO, of course, speaks for
itself.

> # Most documents
>
> s/Most/Many/ ?

Sounds good.

> # Arbitrary-direction entities also don't cause a problem when they
> # are displayed as a separate block element (which is treated as a
> # separate "paragraph" in UBA terms).
>
> Append
>  | and the base direction is correctly marked up with the HTML 'dir'
>  | attribute.


How about:

Arbitrary-direction entities also don't cause a problem when they are
displayed as a separate block element (which is treated as a separate
"paragraph" in UBA terms), with a dir attribute declaring their correct base
direction.

> # <span dir="rtl"> only explicitly states the direction
>
> s/direction/base direction/

Yes

On Fri, Mar 5, 2010 at 11:21 PM, fantasai <fantasai.lists@inkedblade.net>wrote:

> Richard Ishida asked me to resend my original comments on the draft to the
> new public mailing list. Forwarded messages below.
>
> Aharon, if you repost your reply, I will repost my follow-up as well.
>
> Thanks,
>
> ~fantasai
>
> -------- First Message --------
> Subject: Re: A Proposal for HTML Improvements for Bidi: Please review
> Date: Wed, 17 Feb 2010 13:46:46 -0800
> From: fantasai
>
> On 02/17/2010 12:47 PM, Richard Ishida wrote:
> > Dear i18n WG members,
> >
> > Aharon Lanin of Google has been preparing a proposal for additions to
> HTML that will
> > address practical issues when dealing with bidi text - particularly when
> handling
> > text that is inserted into a page from a database, etc.  He has had some
> feedback on
> > the text from bidi experts, and has prepared a new version of the
> proposal (however,
> > section 1.2 still needs some work - hopefully ready by the end of this
> week).
> >
> > This initial draft is available at
> http://www.w3.org/International/wiki/BidiProposal
>
> From a quick scan, it looks like a very thorough and intelligent discussion
> of
> problems with the existing BiDi infrastructure in HTML and CSS. My only
> comment
> at the moment is that a number of these problems should be solved at the
> CSS
> level, in addition to (or in some cases, instead of) the HTML level.
>
> The CSSWG has at least one open BiDi issue on CSS2.1. I'm happy to take
> back any
> other changes we need to make to the CSS2.1 specs as a result of these
> discussions.
> (At least two of the proposals require new features, such as additional
> values to
> 'direction' and 'unicode-bidi'. These might need to be filed for CSS3 Text
> Layout
> instead.)
>
> ...
>
> ~fantasai
>
>
>
> -------- Second Message --------
> Subject: Re: Bidi proposal draft
> Date: Wed, 24 Feb 2010 02:59:23 -0800
> From: fantasai
>
> On 02/22/2010 02:22 AM, Richard Ishida wrote:
> > Hi Aharon,
> >
> > I have migrated the bidi proposal wiki text to the format needed to
> > publish as a Working Draft.  There may be a couple of additional things
> > to do or mistakes, but hopefully most of the work is now done.
> >
> > See http://www.w3.org/International/docs/html-bidi-requirements/
>
> ====== Substantive comments ======
>
>
> bdi=yes
> # The element, even when empty, is to be displayed as if it were surrounded
> # with strong-directional characters of the last explicit embedding level
> # within which it appears.
>
> Why not treat the element as U+FFFC instead? If the intent is isolation so
> that it doesn't affect surrounding text, wouldn't treating it as a neutral
> would make more sense than treating it as a strong character?
>
> dir=auto
> # Make simple direction estimation functionality available in the browser
> by
> # allowing the dir attribute to take on new values indicating that the user
> # agent is responsible for estimating the direction of the element's
> contents.
> # One such dir attribute value would specify using the word-count
> algorithm,
> # defined and discussed in Appendix A. Another would specify the
> first-strong
> # algorithm, as defined by the UBA.
>
> The Mozilla devs I talked to are skeptical that authors would know which
> algorithm to choose. Also, scanning the entire text has performance
> implications, especially for large elements and for when there are DOM
> mutations involved. We suggest considering the following:
>  - Of the first 64 characters after and including the first strong
>    character, if any (or some low percentage) are strong RTL,
>    consider the element's computed base direction to be RTL, else
>    LTR.
>
> # One possibility for such a specification would be with a new HTML
> attribute:
> # hflip="no|yes|ltr|rtl".
>
> Is there a use case for the 'yes' value?
>
> # Just as with <br>, in Firefox and Opera, an embedded block element
> provides
> # no bidi separation between the text preceding and following it, while IE
> and
> # WebKit treat it as a UBA paragraph break. ... The text before and after a
> # block element is said to form "anonymous blocks", and it is well accepted
> # that blocks should constitute UBA paragraphs.
>
> Yes, I would consider this a bug in Firefox and Opera: their behavior is
> clearly violating the CSS spec.
>
> A related problem is block elements that are rendered as display: inline.
> The CSSWG has an issue filed on having these default to unicode-bidi: embed
> in our sample HTML4 style sheet. (Looking at your document, they might also
> need bdi=yes.)
>
> ====== Editorial comments ======
>
> # HTML, the UBA is
>
> s/HTML/In HTML/ ?
>
> # This is because text displayed in the wrong direction is often garbled.
>
> s/wrong direction/wrong base direction/
>
> # and "MAKE html WORK FOR YOU" is displayed in LTR as
> #
> # EKAM html UOY ROF KROW
> #
> # instead of the intended
> #
> # UOY ROF KROW html EKAM
>
> I suggest having the indented output be MAKE html WORK FOR YOU,
> and rearranging the previous two strings as required. I think
> that might get the point across a little better. :) But I suppose
> the disadvantage is that the source order is no longer in
> logical order.
>
> In 2.1:
>
> # The UBA's rendering of a piece of text depends not only on the
> # explicitly declared direction in which it appears (e.g. the dir
> # attribute value on the parent element)
>
> s/explicitly declared direction/base directional context/ or somesuch
> s/the dir/as set by the dir/
>
> # The bidi formatting characters LRO, RLO, LRE, RLE, and PDF have
> # particularly strong influence on what surrounds them.
>
> This is somewhat overstated and vague. I would suggest something like
>
> | The bidi formatting characters LRO, RLO, LRE, RLE, and PDF can
> | fine-tune the bidi algorithm by either overriding the implicit
> | directionality of characters (LRO, RLO) or creating an embedded
> | base directional context (LRE, RLE)
>
> # Most documents
>
> s/Most/Many/ ?
>
> # Arbitrary-direction entities also don't cause a problem when they
> # are displayed as a separate block element (which is treated as a
> # separate "paragraph" in UBA terms).
>
> Append
>  | and the base direction is correctly marked up with the HTML 'dir'
>  | attribute.
>
> # <span dir="rtl"> only explicitly states the direction
>
> s/direction/base direction/
>
> ~fantasai
>
>
Received on Wednesday, 17 March 2010 15:58:17 UTC