[Bug 10808] i18n comment 2 : new dir attribute value: auto, and a new attribute: autodirmethod

http://www.w3.org/Bugs/Public/show_bug.cgi?id=10808

--- Comment #17 from Aryeh Gregor <Simetrical+w3cbug@gmail.com> 2010-10-19 18:33:34 UTC ---
(In reply to comment #14)
> The only use case given in this bug so far is the one in comment 3, which as
> far as I can tell is the same as the use cases given in bug 10807. If there are
> other use cases to consider here, such as the ones in comment 12, then please
> describe them, ideally with URLs pointing to real Web pages showing those use
> cases, so that I can study them. It's impossible to evaluate proposals without
> concrete use cases.

The use-cases are entirely different.

Bug 10807 is about wanting isolation: when multiple logically distinct strings
that might differ in direction are part of the same UBA paragraph, the UBA
needs to be told that they're logically isolated so that part of one and part
of another don't get mixed together into one run.  E.g.,

Logical:        my favorite hebrew letters are A, B, and C
Correct visual: my favorite hebrew letters are A, B, and C
Actual visual:  my favorite hebrew letters are B, A, and C

This bug has nothing to do with isolation.  We're talking only about blocks
here, and blocks are always isolated from one another.  What we want here is
some way to auto-detect the direction of a block.  E.g., if there's a textarea
where users might type in either English or Hebrew, then if the user starts
typing in Hebrew, it should automatically switch to RTL so that the cursor
doesn't jump around crazily as you type.  But nor should it do that in English.

(I encourage you to try this out.  Go to data:text/html,<textarea
dir=rtl></textarea> and type a few sentences in English.  That's what you get
when you try to type in Hebrew on any LTR site, i.e., practically any site. 
But this isn't just textareas, it also applies to any block content of unknown
direction.)


Here's my sketch of a proposal for fixing this.  Add a new value for dir,
dir=auto.  This is logically equivalent to saying that the element doesn't have
a known direction, and the direction should be determined automatically.  In
terms of CSS, it should translate to [dir=auto] { direction: auto;
unicode-bidi: embed; }.

The CSS "direction: auto" would be defined something like this.  For each UBA
paragraph, namely each "sequence of inline boxes uninterrupted by a forced line
break or block boundary" (quote from CSS 2.1), if the containing block's
computed value of direction is "auto", that paragraph has its direction
determined heuristically.  The heuristic might be as follows:

1) If the content is modifiable by the user, like <input> or <textarea>, decide
direction based on the first strong-directionality character entered.

2) Otherwise, look at the first X Unicode code points, and if at least Y% are
strong RTL, it's RTL; else, LTR.  In practice, X might be infinity if that's
okay with implementers, and Y probably something like 30.  (X = infinity might
cause jumping if the content is loaded incrementally, but in practice that's
unlikely, as Aharon notes.)

Note that if multiple UBA paragraphs are contained in a single dir=auto
element, like with textarea or pre, they might have different direction.  This
is the same as if they started with an appropriate control character, so should
be no big problem.

As to whether this should be part of CSS or HTML -- if direction: rtl/ltr
remains conforming, then so should this.  If controlling directionality from
CSS is really always a bad thing, then have CSS make the property
non-conforming, and move the processing model to HTML.  In the latter case,
HTML might still define the property in terms of CSS, but specify that certain
properties or values are to be ignored outside of UA stylesheets, or something
like that.


(In reply to comment #16)
> 1. Estimating the direction of each UBA paragraph separately has a price.

Namely?

> 2. The use cases are limited to <textarea> and <pre>.

True, if those are the only HTML elements that can contain multiple UBA
paragraphs, but there's no reason not to specify that behavior across the board
for simplicity.

> Let's take a specific example:
> 
> <div dir=auto>
>   some ltr text.
>   <div>
>     SOME RTL TEXT.
>   </div>
>   SOME MORE RTL TEXT.
> </div>
> 
> There are three UBA paragraphs here: the text before the internal div, the text
> inside it, and the text after it. What you want is to have the first displayed
> in LTR, and the others in RTL, and are puzzled why dir=auto is defined to give
> them all the same direction (for autodirmethod values other than plaintext).

In my proposal, both divs have a computed direction value of "auto", so all
three UBA paragraphs are in a containing block whose computed direction value
is "auto".  Therefore the first will be LTR, the second RTL, the third RTL
(leaving aside the question of what heuristic to use).  IMO, this is the
expected and correct behavior.

> Now, the use cases. It is indeed possible to have multi-paragraph plain text
> that can only be rendered well by assigning each of its UBA paragraphs its own
> direction (as explicitly suggested by the UBA). However, such plain text is
> limited to <textarea> and <pre> elements. <textarea> does not allow mark-up at
> all, so the problem described above does not apply to it; <pre> is allowed to
> contain some mark-up, but being pre-formatted, it is not expected to contain
> the layout-modifying mark-up of the sort that bothers us. This is the use case
> for autodirmethod=plaintext, which does per-paragraph estimation like you want,
> but is not expected to handle well direction-dependent CSS within it.

Why shouldn't it handle direction-dependent CSS within it well?

> On the other hand, I do not see a use case for the dir=auto in the example
> above to automatically apply independently to the internal div. If the author
> wants auto-estimation on the internal div, let him put dir=auto on the internal
> div. For example, if you are embedding a piece of complicated HTML that you did
> not author in your page, and you do not know the direction in which this piece
> of HTML is supposed to be displayed, put a <div dir=auto> around that piece of
> HTML. If inside it there are smaller pieces that have a different direction, it
> was the job of the HTML's original author to indicate this within the HTML,
> e.g.  with dir=auto elements around those smaller pieces.

So are you saying that if I want all of my direction to be automatically
determined, then I have to repeat dir=auto on every single block element
instead of just specifying it once on html or body?  That doesn't make sense at
all to me.  What I'd like to see is people putting dir=auto on the root
elements of all their pages, so that everything magically works as expected in
almost all cases (and you can explicitly override directionality in
exceptions).

Inserting HTML from an unknown source where the whole chunk must have the same
directionality but the overall directionality is unknown is not at all an
important use-case, IMO.  When would this come up in practice?

> The reason they exist is not to make it easier for the platform, but because
> different approaches give better results for different kinds of content.

Are authors better situated to figure out which is appropriate when, or browser
implementers?  I suspect the latter.  Authors should not have to understand
Unicode bidi to use dir=auto -- they should be able to slap it on their pages
and have things work right across the board.  Ideally this should be the
platform default, in fact -- the only reason to do otherwise is legacy
compatibility, if that.

> First-strong has a serious flaw: RTL text very often contains LTR words and
> phrases (e.g. acronyms and brand names) and even fairly often starts with them,
> e.g. "html IS A WONDERFUL PLATFORM". I therefore tend to prefer any-rtl for
> most cases. However, in an input box, first-strong does have the advantage of
> being easier for the user to surmise and control. Thus, I would say, if you
> have content you are obtaining via an input box, use first-strong (both on the
> input box and the elements that are then used to display those values). But if
> you are  displaying text of unknown origin, any-rtl is a better bet.

Why is first-strong better even on the element used to display the value?  Why
not use first-strong when the user inputs the text, but any-rtl (or some
variant, maybe X% RTL in the first Y characters) when the text is subsequently
displayed?  Surely first-strong is very unlikely to produce more correct
results than an any-rtl variant in practice, if the whole beginning of the
contents is available.

> BTW, flips are
> also still possible but unlikely for first-strong, since the element could
> start with an arbitrary amount of neutral content.

True.

> Better estimation algorithms can and will be invented. The reason we are
> currently only dealing with first-strong, any-rtl, and plaintext is that they
> are well-known, tried, and easily defined and implemented. If and when a much
> better algorithm is invented and proven, we want to be able to support it. That
> does not mean that existing content that was created with and works for an
> older estimation method should be potentially broken by applying the new
> estimation algorithm to it without being asked to do so. This is exactly why we
> have autodirmethod. We can extend the repertory of its values without making
> them the default for existing content.

I don't think we need to worry about future-proofing much.  We can always add
new dir values at a future date, for example, or new attributes, or whatever,
in the unlikely event that someone comes up with a brilliant new algorithm. 
However, I don't think authors should be asked to deal with the complexity of
choosing different autodirmethods for different types of content, if we can do
a good enough job heuristically.  Does the heuristic I describe above sound
like it would fail a significant amount of time in real-world content?

> I tend to agree, but not everyone does. A discussion worth having, although it
> would have been better if it had already taken place in public-i18n-bidi before
> the bugs were filed on HTML5.

I'd say the contrary, that it's better to have these things widely discussed as
early as possible.  i18n experts should come up with use-cases, and then they
should work with web experts (browser implementers, spec editors, etc.) from
day one on the solutions.  i18n experts coming up with entire proposed
solutions and only then presenting them to web experts will result in a lot of
them getting shot down and rewritten from scratch, as has in fact happened on a
number of these bugs.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You reported the bug.

Received on Tuesday, 19 October 2010 18:33:43 UTC