Fwd: A Proposal for HTML Improvements for Bidi, Part 1: Bidi Aspects of Existing HTML Features

Seems to be useful input for the HTML WG too (forwarding the “rich” version
to avoid loss of some links):



---------- Forwarded message ----------
From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Mon, Oct 26, 2009 at 10:50 PM
Subject: A Proposal for HTML Improvements for Bidi, Part 1: Bidi Aspects of
Existing HTML Features
To: www-international@w3.org


The following is the first part of a proposal for small improvements in HTML
handling that should help make it easier to author quality bidi HTML pages
and web applications. It is based on the issues that have repeatedly come up
during efforts to add bidi support to various Google products that my team
has been charged with aiding over the past two years.

The current version of this proposal is also available at
http://docs.google.com/Doc?id=dd6f586t_19dg4pkqqc

Aharon Lanin
Google Israel

*A Proposal for HTML Improvements for Bidi*
*
Part 1: Standardizing Bidi Aspects of Existing HTML Features
*

Preliminaries:

   - UBA: the Unicode Bidi Algorithm <http://unicode.org/reports/tr9/>.
   - LTR: left-to-right
   - RTL: right-to-left
   - Text displayed in the wrong directionality is often garbled. For
   example, the LTR value "10 Main St." is displayed in RTL as ".Main St 10".

*
1.1. <br>, <hr>, and embedded block elements should "reset" bidi state
*
*Background
*
The UBA's sections 3.3.1 and 3.3.2 require that the bidi state be completely
reset at a "paragraph break". This means that strongly-directional text
(e.g. Latin or Hebrew letters) and explicit bidi formatting characters (e.g.
LRE and RLE) in one paragraph have no effect on the formatting of the text
in the next paragraph and vice-versa. However, this requirement leaves the
definition of a "paragraph" up to the implementation. Most plain-text
environments implementing the UBA (e.g. Microsoft Word, Windows Notepad,
GNOME gedit, OSX textedit etc.) treat newline and other line-breaking
characters as a paragraph break for UBA purposes.

*The Problem
**
*In HTML, it is well accepted that a block element constitutes a UBA
paragraph. However, there is no uniformity in the treatment of <br>,  <hr>,
and embedded block elements (e.g. <div></div>) in this respect.

Firefox and Opera completely ignore them. As a result, when rendering "1.
His Hebrew name is אאא.<br>2. בבב is a friend of his." (in an LTR context),
they treat the "אאא. 2. בבב" as a single RTL run, and thus put the "2" on
the *right *of the "בבב", with the resulting looking like:


The result is unreadable. This behavior is even stranger when instead of a
<br>, you have a tall <div>, so the effect of the "אאא." is felt again
somewhere far down the page. Failing to treat <br>, etc. as a UBA paragraph
break goes against the spirit of the UBA - but not its letter.

Similarly, in "You can use RLO to make English text go
&#x202E;RIGHT-TO-LEFT.<br>But you don't have to.", the unterminated RLO is
allowed to exert its influence beyond the <br>, reversing the characters in
the next line too:


IE and WebKit, on the other hand, treat <br>, <hr>, and embedded block
elements as Unicode Bidi Algorithm paragraph breaks, making it easier to
author bidi HTML documents.

However, WebKit currently goes too far, with a <br> terminating the effects
of all directionality levels, including that specified using HTML or CSS on
the ancestor inline elements, e.g. <span dir=...>. As a result, it displays
"<div dir=rtl><span dir=ltr>1. Hello!<br>2. Goodbye!</span></div> with the
second line in RTL:


While this does conform to the literal definition of what a paragraph break
is supposed to do according to the UBA, it goes against the spirit of HTML.
Attempts to fix WebKit in this regard are stymied by the lack of a mandated
specification and are reduced to guessing what exactly it should do.

IE, on the other hand, seems to terminate the effect of any LRE, RLE, LRO,
and RLO formatting codes, but not the effect of the dir attributes of
ancestor elements, which seems like a reasonable approach. What exactly it
does, however, is undocumented.

*The Proposed Solution*

The HTML specification should state that any sort of line break - e.g.
<br>, <hr>, and embedded block elements - should be treated as a UBA
paragraph break. However, the directionality embedding levels stemming from
the direction specified on ancestor elements via mark-up (dir attribute,
<bdo> element) or CSS up to the closest ancestor block element should then
be re-opened in the same order at the start of the new paragraph. This
re-opening of embedding levels is allowed by the UBA's section 4.3, HL3.

*1.2. newline and other line-breaking characters should "reset" bidi state
in <textarea>, <pre> and script dialog text.*

*Background

*As in 1.1 above.

*The Problem

*IE and WebKit treat newlines as a UBA paragraph break in all these
contexts. Firefox, however does not treat is as such in any of them, while
Opera treats it as such in <textarea> and dialog text, but not in <pre>. As
a result, Firefox and Opera display "<pre>1. His Hebrew name is אאא.&#x0A;2.
בבב is a friend of his.</pre>" as


*The Proposed Solution*

The HTML specification should state that any sort of line-breaking character
- e.g.  &#x0A;, &#x0D; - in <textarea>, <pre>, and script dialog text should
be treated as a UBA paragraph break. However, in <pre>, the directionality
embedding levels stemming from the direction specified on ancestor elements
via mark-up (dir attribute, <bdo> element) or CSS up to the closest ancestor
block element should then be re-opened in the same order at the start of the
new paragraph. This re-opening of embedding levels is allowed by the UBA's
section 4.3, HL3.

*1.3. <title> and script dialogs should use the page's directionality*

*Background

*The W3C recommends<http://www.w3.org/TR/i18n-html-tech-bidi/#ri20030112.214820604>that
in HTML, the directionality of text be declared using the dir
attribute, avoiding the use of Unicode formatting characters LRE, RLE, and
PDF except where the dir attribute is inapplicable.

*The Problem*

One would expect that the page's directionality set using <html dir=...>
would apply to the page's <title>, as well as to the text of the page's
script dialogs (alert(), confirm(), etc.). Unfortunately, however, this is
not the case in any major browser - not Firefox, Chrome, Safari, or Opera.
IE6 and IE7 used to apply <html dir=...> to dialog script text, but this is
no longer the case in IE8. The directionality context all these browsers use
for <title> and dialog text is either the OS or the browser chrome's default
directionality, which neither the server nor page scripts can even
determine, let alone control.

Since a value displayed in the wrong directionality can come out garbled,
RTL pages wind up having to wrap their RTL <title> and dialog text in RLE +
PDF characters. On the other hand, LTR pages dare not wrap their LTR <title>
and dialog text in LRE + PDF characters for correct display on RTL systems,
since most computers in the world are running an LTR OS without RTL script
support turned on, and thus display LRE and PDF as rectangles. Furthermore,
these formatting characters are little-known, lack named entities, and are
generally undesirable in HTML documents.

*The Proposed Solution*

The HTML specification should state that dialog text will be displayed in
the <html> element's directionality, and the <title> in its directionality,
whether set directly on the <title> element itself or inherited from an
ancestor. It is desirable to allow the dir attribute on <title> itself for
cases where the title happens to be in a different language than the overall
page and thus may not match the page's overall directionality, but it is not
nearly as important as at least applying the <html>'s directionality,

It is easy enough for a browser to implement this, since it knows the
default directionality context in which the text will be displayed. If and
only if this differs from the desired directionality, the browser needs to
wrap (each paragraph of) the text in question in RLE + PDF when RTL is
desired and LRE + PDF when LTR is desired.
*
**1.4. title and alt attributes should use the element's directionality*

*Background

*As in 1.3 above.

*The Problem*

Currently all major browsers (IE, FF, Chrome, Safari, Opera) display
tooltips stemming from the title and alt attributes in the directionality of
the element where they appear, but this does not appear to be formally
specified anywhere. Furthermore, this consensus seems fragile because in
principle, the directionality of an element and the text of its tooltip do
not have to coincide. Here is a reasonable counterexample: a Hebrew web page
displaying an English address with a Hebrew tooltip meaning "address" would
use "<span id=address title=כתובת dir=ltr>10 Main St..</span>".

Until recently, Chrome displayed tooltips in the OS / browser's default
directionality. When fixing this bug, the initial inclination was to apply
only the page's directionality, not the element's, due to the "in principle"
consideration above.

Apparently not trusting browser behavior, the W3C
suggests<http://www.w3.org/TR/i18n-html-tech-bidi/#tech-tooltips-etc>that
tooltip directionality may have to be set using LRE | RLE + PDF. This
is actually quite difficult to do properly, since wrapping an LTR tooltip in
LRE + PDF just in case the browser winds up displaying it in an RTL context
will result in the LRE and PDF displaying as rectangles on LTR OS's without
RTL support enabled, i.e. the vast majority of computers.

*The Proposed Solution*

The HTML specification should state that title and alt attribute text will
be displayed in the element's directionality.

Although counterexamples as given above can be found, tooltip text most
usually does have the same directionality as the element's text even where
the element does have text, which is not very often. For such
counterexamples, there is a simple workaround in the form of putting the
tooltip on an extra element wrapping the original one, e.g. "<span
title=כתובת><span id=address dir=ltr>10 Main St.</span></span>".

The alternatives are even less desirable. Having the tooltip use only the
page's directionality increases the need to use LRE | RLE + PDF. And
defining new alt_dir and title_dir attributes seems wasteful.

*1.5. <option> should support the dir attribute, and be displayed that way
in both the dropdown and after being chosen*

*Background

*As in 1.3 above.

*The Problem*

In a single <select>, the values of different options may have different
directionalities. Currently, however, out of all major browsers, only FF
supports the dir attribute on <option>, and does so poorly: once the value
is chosen, it is displayed in the <select>'s directionality.

IE and Opera display all options in the <select>'s directionality.

Safari automatically estimates the directionality of each option and
displays it as such both in the dropdown and after it has been chosen
regardless of the <select>'s directionality (which is only used to place the
down-arrow button and to align the values). This is all very nice, but
directionality estimation algorithms do make mistakes, so it would be good
to be able to specify the actual dir value for a given <option> - and Safari
does not support that.

Chrome does not support the dir attribute on <option> and is on its way to
doing what Safari does.

As a result, the only practical way to specify <option> value directionality
is using LRE | RLE + PDF, which is cumbersome.

*The Proposed Solution

*The HTML specification should state that setting an explicit directionality
on <option> should determine the way it is displayed in both the dropdown
and after being chosen. Using auto-estimated directionality is allowed when
the <option> element does not have an explicitly specified directionality.

*1.6. <input type="text"> and <textarea> should support compatible "set
direction" functionality*

*Background

*Garbling by incorrect directionality applies to text being entered by the
user in an input control, too. In fact, entering text of directionality
opposite to the input is an unpleasant experience even if the full text does
not wind up being garbled, due to the cursor jumping around during data
entry and difficulty in selecting text. Some means for the user to set the
directionality of the input, and for page scripts to be informed of this
choice so the text's intended directionality can be stored is thus highly
desirable.

*The Problem*

All major browsers provide some way for the user to set the directionality
of each <input type="text"> and <textarea> element, e.g. via "hot keys".
However, the way this functionality interacts with page scripts varies
drastically between browsers.

IE: The "hot keys" are CTRL + LEFT SHIFT for LTR and CTRL + RIGHT SHIFT for
RTL. (These key combinations are also adopted for this purpose by most
Microsoft products, e.g. Windows dialogs, notepad and Word.) They set the
value of the element's dir attribute, which is then available to scripts.
They trigger the onpropertychange event, at which time the dir value is
already changed. They trigger onkeyup, but *before *the dir value has been
changed, so setTimeout(0) has to be used to get the updated die value. They
do not trigger onkeypress.

FF: The "hot key" is CTRL + SHIFT + X, which cycles through LTR and RTL. It
does *not *set the value of the element's dir attribute, and is thus
invisible to scripts.

Opera: The "hot keys" are CTRL + LEFT SHIFT for LTR and CTRL + RIGHT SHIFT
for RTL, as in IE. They do *not *set the value of the element's dir
attribute, and are thus invisible to scripts..

Chrome: The "hot keys" are CTRL + LEFT SHIFT for LTR and CTRL + RIGHT SHIFT
for RTL. They set the value of the element's dir attribute, which is then
available to scripts. They trigger the onkeyup event, at which time the dir
value is already changed. They do not trigger onkeypress or oninput. They do
not trigger onpropertychange, since this event exists only in IE.

Safari: Right-click on the <input> or <textarea> provides a "Set paragraph
direction" submenu. It is unclear whether "hot keys" can be configured.
Using "Set paragraph direction" sets the value of the element's dir
attribute, which is then available to scripts. However, it does not trigger
onkeyup, onkeypress, or oninput. It also doesn't trigger, or
onpropertychange, since this event exists only in IE.


*The Proposed Solution*

The HTML specification should state that some way to set the direction of
<input type=text> and <textarea> elements should be exposed to the user, and
using it will:

   - set the element's dir attribute accordingly
   - trigger onkeyup *after *the dir attribute has been set
   - trigger oninput; even though no actual input took place, the user did
   change the recommended interpretation of the input already collected
   - trigger onkeypress? I am not sure. But one way or the other, it should
   be specified.

*
*Furthermore, it should be stated that on an OS that has a widespread
convention for setting direction (such as CTRL + LEFT SHIFT for LTR and CTRL
+ RIGHT SHIFT for RTL on Windows), the user agent will support that
convention (although it may provide other methods too).

*1.7. auto-completion should remember and use the directionality of each
value*

*Background*

Some browsers implement auto-completion, a feature whereby values previously
entered into an element like <input type=text> are remembered and under
certain conditions presented to the user in a dropdown. When the user
selects one of the items in the dropdown, this value is assigned to the
element. At different times, the user may enter values of different
directionality for the same input. The directionality of a value is set
either directly by the user through a "set direction" command exposed by the
browser (e.g. "hot keys", see 1.6 above) or letting page scripts
automatically set the input's dir attribute after estimating the
directionality of the value on the fly.

*The Problem
*
Browsers do not remember the directionality of previously-entered values.
Some display them in the dropdown in the OS or browser default
directionality. Some display them in the input's current directionality.
Finally, some display each value in its own estimated directionality. Each
of these will result in some values being displayed incorrectly; even the
last approach will sometimes fail because estimation algorithms do make
mistakes, and this may not have been the directionality originally set by
the user or page scripts.

After the user chooses a value from the dropdown, the value is usually
displayed in the input's current directionality, which may or may not be
correct for it.

*The Proposed Solution*

The HTML specification should state that if a user agent implements
auto-completion, it should store the last-used directionality for each
value. This may be the original directionality of the element, or may have
been set by the user for that value via directionality "hot keys", or may
have been set for that value by page scripts. When a value is displayed in
an auto-completion dropdown, it should be displayed in the directionality
stored for it. When a value is chosen by the user, the element's dir value
should be set to the directionality stored for it.



-- 
Jens Meiert
http://meiert.com/en/

Received on Tuesday, 27 October 2009 20:21:30 UTC