Re: bidi proposal from fantasai on 2010-03-17 (public-i18n-bidi@w3.org from January to March 2010)

From: fantasai <fantasai.lists@inkedblade.net>
Date: Wed, 17 Mar 2010 14:32:50 -0700
To: "Aharon (Vladimir) Lanin" <aharon@google.com>, public-i18n-bidi@w3.org
Message-ID: <4BA14A82.3030604@inkedblade.net>
On 03/17/2010 07:05 AM, Aharon (Vladimir) Lanin wrote:
> Finally getting to repost my  reply. Sorry for the delay.

Reposting my follow up. :)

On 02/24/2010 06:08 AM, Aharon (Vladimir) Lanin wrote:
 >
 > >  Also, scanning the entire text has performance
 > >  implications, especially for large elements
 >
 >
 > True, but:
 > - This is not done by default for anything; the page explicitly
 >   requested it.

I wouldn't be surprised if some people use dir="auto" for a much
wider scope than you are expecting.

 > - We are considering putting a limit on the number of characters or
 >   perhaps the number of words scanned (see "possible" tweaks at end of
 >   appendix)
 > - Even if the whole text is scanned, the amount of processing is O(n);
 >   the browser will then spend at least O(n) time preparing the text for
 >   display by the underlying platform, probably with a larger factor.

The CSSWG is considering selectors based on the HTML-resolved base
direction of the element, e.g. :rtl and :ltr  to style elements
differently based on their HTML-declared base directionality.
Currently, selectors do not need to scan the entire contents of
the element, and doing so is a significant performance hit.

 > >  and for when there are DOM mutations involved.
 >
 > The estimation only needs to be done when the browser decides to render
 > the page.

The browser has to render the page live. If JavaScript is manipulating
the DOM as the user interacts with the page, then the browser is
deciding to render the page very frequently.

 >     We suggest considering the following:
 >       - Of the first 64 characters after and including the first strong
 >         character, if any (or some low percentage) are strong RTL,
 >         consider the element's computed base direction to be RTL, else
 >         LTR.
 >
 >
 > This is basically the any-RTL algorithm, and I think is generally less
 > useful than either first-strong or word-count. It fails on casual LTR
 > text "peppered" with some RTL words (e.g. a chat between expats from RTL
 > countries), as well as on scholarly LTR text that uses some RTL words
 > for precision (e.g. a discussion on biblical topics).
 >
 > ...
 >
 > I have suggested another alternative in the past, but did not merit any
 > replies at all :-(. Perhaps I should give it another try. Basically, the
 > idea is to unify first-strong and word-count under the disguise of
 > giving the page author control over how much of the content is scanned:
 >
 > - The syntax for the dir value is "ltr|rtl|auto[0-9]*" or some more
 > palatable version.
 > - All the auto values always use word-count - but stop after scanning
 > the first n strongly-directional words. Thus, by using a number, the
 > page author specifies how thorough a job the estimation should do.
 > - As a result, "auto1" is almost exactly identical to first-strong! The
 > exception is that "weak ltr" values, e.g. "(212) 123 4567", wind up
 > being treated as LTR, which is a good thing. Thus, we wind up exposing
 > first-strong as well as word-count.
 > - Plain "auto" is a synonym for either "auto1" or some likely
 > word-county value, e.g. "auto64" - TBD.

That's an interesting approach. I'll take it back and ask the Mozilla
folks what they think.

My only comment atm is "define word". I think it would be much less
ambiguous to count by character.

 > >  A related problem is block elements that are rendered as display: inline.
 > > The CSSWG has an issue filed on having these default to unicode-bidi: embed
 > > in our sample HTML4 style sheet. (Looking at your document, they might also
 > > need bdi=yes.)
 >
 > If I remember correctly, all browsers until very recently treated text
 > in a <div style="display:inline"> as if it were in a <span>. I now see
 > that Firefox 3.6 has indeed broken ranks and is now treating it as if it
 > were a <span style="unicode-bide:embed">. I am not sure what the
 > rationale for this is. IMO, using unicode-bidi:embed without specifying
 > direction, i.e. defaulting to the parent's direction, is a technique in
 > search of a purpose. I think I liked the old behavior better.

The old behavior would cause things like

   abcd.
   <div style="display:inline">efg HIJK.</div>
   LMNOP qrst.

which is normally rendered as
   abcd
   efg KJIH.
   PONML qrst.
to be rendered as
   abcd. efg PONML .KJIH qrst.

Since the author chose a block-level element, it's reasonable to assume
that they meant for that string of text to stay together as one unit.

Applying "unicode-bidi: embed" to all HTML block-level elements was
actually in an earlier version of the CSS2 sample style sheet for HTML.
I suspect it was removed because it's redundant with "display: block" --
but it becomes relevant when the display type is changed. The CSSWG
currently has an open issue about adding it back.

 > I agree that it would be best to come up with an explicit spec of what
 > the expected bidi behavior is in all these strange combinations of
 > element type and display type.

Yes.

 >     # The bidi formatting characters LRO, RLO, LRE, RLE, and PDF have
 >     # particularly strong influence on what surrounds them.
 >
 >     This is somewhat overstated and vague. I would suggest something like
 >
 >     | The bidi formatting characters LRO, RLO, LRE, RLE, and PDF can
 >     | fine-tune the bidi algorithm by either overriding the implicit
 >     | directionality of characters (LRO, RLO) or creating an embedded
 >     | base directional context (LRE, RLE)
 >
 >
 > I'd rather stay vague and not get into the specific effects of these
 > characters, which has the price of too much detail on a relatively minor
 > point. I am not sure why you think that the current formulation is
 > overstated, though. I think that these characters indeed do have a
 > particularly string influence. For example, in an LTR context, putting
 > an RLE before "please tell JOHN to go to school" changes its display to
 > "to go to school NHOJ please tell"; putting a simple strong RTL
 > character, e.g. RLM, has no effect at all. But in all examples where RLM
 > does have an effect, RLE will have at least as much effect. RLO, of
 > course, speaks for itself.

They have a strong effect on the text in between the initial control code
and its PDF. But their effect on surrounding characters is no different
from a typical strongly-directional character.

If the details of these characters are not important to this section,
then I suggest moving the paragraph to somewhere more relevant: I think
a brief explanation like the above would help with understanding other
parts of this document. A description of the RLM and LRM characters
would be more helpful for this particular issue.

 > > # Arbitrary-direction entities also don't cause a problem when they
 > >  # are displayed as a separate block element (which is treated as a
 > >  # separate "paragraph" in UBA terms).
 > >
 > >  Append
 > >    | and the base direction is correctly marked up with the HTML 'dir'
 > >    | attribute.
 >
 > How about:
 >
 > Arbitrary-direction entities also don't cause a problem when they are
 > displayed as a separate block element (which is treated as a separate
 > "paragraph" in UBA terms), with a dir attribute declaring their correct
 > base direction.

Only if you remove the comma -- it is a qualifying preposition, not
an incidental description.

~fantasai
Received on Wednesday, 17 March 2010 21:33:33 UTC