Re: Rework of Bidi inline article

Date: Tue, 22 Nov 2011 16:17:05 +0200
Message-ID: <CA+FsOYZ=dZ1troJeU+_BW4J6TmEYfpcaTDsg5XcfC4QMNQ+cyw@mail.gmail.com>
To: Richard Ishida <ishida@w3.org>
Cc: Matitiahu Allouche <matial@il.ibm.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
```Sorry that it took me a while to answer. However, most of the time was
spent on formulating my suggestions below.

you are correct that people leave a gap and write the number from the
biggest digit to the smallest, since that is the order in which they have
the number in their head. You are also correct about the order in which
math is written in Hebrew and (supposed to be) written in Arabic. I do not
know Arabic and have never tried writing math right-to-left, so I don't
have a clue about the hand movements there. I would expect that they are
indeed a challenge.

Re the article, I think that the sections looking at solutions for the five
problematic cases ("Neutral ...", "Weak ...", "Nesting ...", "Adjacent
...", and "Handling unknown") taken together are quite long and tiring. I
believe that this stems from several issues:

1. The sections looks like an arbitrary collection of cases, where for
each you try a bunch of techniques for fixing, and for some one technique
works best, and for some, another. The user is left with the impression
that in order to figure out how to deal with the case giving him
trouble, he will have to figure out which of these cases his most resembles
- a difficult mission for most users. And what if his case resembles more
than one of these cases? And what if the user is faced with plopping an
arbitrary, unpredictable piece of text into his page? I believe that what
is necessary is a clear statement that it is the occurrence of
*opposite-direction
phrases *that causes all problems, with a concise statement of how to
handle an opposite-direction phrase (whatever it may be) to make sure that
no problems arise. If this were so, the various cases would be just
examples of applying the general approach - and could be safely skipped by
a reader that understood the general approach. (There are, of course, two
general approaches: one for HTML4, and one for HTML5.)
2. Applying each of the HTML4 mark-up, HTML5 mark-up, and LRM/RLM
techniques to each of the cases is needlessly repetitive. Let's say that a
given technique works for one set of cases, and does not work for another.
Within each set, it works (or does not work) exactly the same way for all
cases, so mentioning it again and again for each case becomes repetitive.
3. The case definitions are needlessly fuzzy. For example, "weak
directional characters that appear at the wrong side of a directional run"
is a conflation of two very different cases: one where an
opposite-direction phrase starts with a number (that is part of it), and
one where an opposite-direction phrase is followed by a number (that is not
part of it). In HTML4, the two have completely different solutions.
4. There may be simply too many cases / examples.

So, here is my attempt at an alternative way of presenting the material,
starting from the beginning of "Where the algorithm needs help". I am
reusing your copy in many places, but watch out where I may have made
changes.

The bidi algorithm will handle text perfectly well in many situations, and
often no special markup or other device is needed other than to set the
overall direction for the document. However, the more a document mixes text
of both directions, the higher the chances that some of it will be
displayed not as intended. When this happens, extra mark-up or other
devices have to be added to the document to untangle the bidirectional text.

We will examine specific examples of what can go wrong, why it goes wrong,
and what fixes it in the sections below. Nevertheless, it is important to
realize that basically, the problems all occur when a text (e.g. a
document) in one direction has to include a phrase in the opposite
direction. Common examples of such "phrases" include quotations, formatted
names, such as brand names, acronyms, part numbers, site names, articles
titles, place names, etc. Whenever an opposite-direction phrase occurs,
things can go wrong. That is, something will go wrong if the text includes,
without any special "wrapping", an opposite-direction phrase that:

- begins or ends with neutral characters
- begins with a number
- is followed by a number
- is followed by another, logically separate opposite-direction phrase
- contains one or more nested phrases whose direction is opposite to *it*

Although this list seems daunting, there is no need to determine which, if
any, of these cases applies to a particular phrase. There are canonical
ways of "wrapping" opposite-direction phrases that will prevent problems in
all of the cases above, and do no harm when none of them apply. We now
describe how such wrapping is done in the current generation of browsers,
and in HTML5.

Wrapping opposite-direction phrases in HTML4

The dir attribute

In principle, the right thing to do for *every* opposite-direction phrase
is to set its base direction by using the dir attribute on an element
tightly wrapping the phrase. (By "tightly wrapping", we mean that the
element contains the entire opposite-direction phrase, and nothing but the
opposite-direction phrase.) When none of the cases above apply, this will
not have any visible effect. But when one of them does apply, the dir
attribute is the right solution.

We can see dir in action in the following example, which tries (in the LTR
context of this page) to say "an introduction to C++" in Arabic, which
should look like "C++ مدخل إلى":

... C++: مدخل إلى C++

<span dir="rtl">... C++</span>: ++C مدخل إلى

<span dir="rtl">... <p dir="ltr">C++</p></p>: C++ مدخل إلى

The first attempt fails with the last word of the phrase, "C++", appearing
in the wrong place. This is because our RTL phrase is of opposite direction
to the (LTR) context, and contains a nested phrase of the original LTR
direction ("C++") inside it. The bidi algorithm, of course, has no way of
knowing that the "C++" is part of the RTL phrase, not of the LTR context,
and thus displays it as the latter: to the right of the Arabic words
instead of to their left. To fix this, we need to wrap the whole phrase in
a <span dir=rtl>.

That is our second attempt, and it still fails with the "C++" coming out as
"++C" instead. This happens because the "C++" is an LTR phrase ending in
neutral characters being displayed in the context of our RTL phrase. The
bidi algorithm has no way of knowing that the plus signs are part of the
LTR phrase, not of the RTL context, and thus displays them as part of the
context: to the left of the "C" instead of to its right.

Our third attempt finally succeeds. It wraps the overall RTL phrase in a
<span dir="rtl">, and the LTR phrase nested inside it in its own <span
dir="ltr">.

LRM/RLM

In addition to the dir attribute, the visual order in which text is
displayed can also be modified by using two invisible Unicode control
characters: LRM (LEFT-TO-RIGHT-MARK, U+200E, &lrm; as a named entity), and
RLM (RIGHT-TO-LEFT-MARK, U+200F, &rlm;). Each has the strong type indicated
by its name, but is invisible, like an invisible A and an invisible א.

One use of LRM and RLM is to *extend* a directional run through neutral or
weak characters at the start or end of an opposite-direction phrase, by
putting a mark of the same direction as the phrase on the other side of the
neutral or weak characters. For example, in our Arabic "Introduction to
C++" example above, instead of wrapping the "C++" in a <span dir="ltr">, we
could add an &lrm; after the second plus:

<span dir="rtl">... C++&lrm;</span>: C++ مدخل إلى

Being strongly LTR, the LRM extended the LTR run through the neutral pluses.

Used this way, however, LRM and RLM are a bit like gotos in programming
languages: a quick hack that, unlike the dir attribute, says nothing about
the structure of the text. And they simply cannot be used to deal with an
opposite-direction phrase that happens to contain a nested phrase in the
original direction, like our complete "Introduction to C++" example above.
That may seem like an esoteric case, but it is surprisingly common when
displaying RTL data in an LTR page, because the use of LTR words (like
"C++") is not uncommon in RTL text. So, if you don't want to analyze
whether LRM and RLM can replace the use of the dir attribute in
*your*case, just use the dir attribute.

Nevertheless, it turns out that LRM and RLM do have an essential function
dealing with opposite-direction phrases in HTML4: *separating *an
opposite-direction phrase from a number or from a separate
opposite-direction phrase that happens to follow it, by putting between
them a mark of the same direction as the *context*. When used this way, LRM
and RLM do not replace the use of the dir attribute, but augment it.

<example for number, e.g. the restaurant example>
<example for two separate phrases, e.g. your use case 6>

Putting it all together in HTML4

To summarize, in HTML4, to make sure that an opposite-direction phrase is
displayed correctly, up to two steps are necessary:

1. Tightly wrap the opposite-direction phrase in an element that uses
the dir attribute to set the direction of the phrase. This is not always
necessary, but never does any harm.

2. If the opposite-direction phrase is followed (possibly after some
intervening neutral characters) by a number or a logically separate
opposite-direction phrase, separate the two with a directional mark
matching the direction of the context. If you do not want to check whether
this is actually the case, you can add a directional mark matching the
context's direction after every opposite-direction phrase.

Wrapping opposite-direction phrases in HTML5

note! This section describes features that are being introduced by HTML5.
At the time of writing, these features are not yet widely supported in
browsers, but the expectation is that they will be supported soon. In the
meantime, you should use these with extreme caution.

<bdi>

HTML5 introduces a new element, <bdi>, expressly for the purpose of
wrapping opposite-direction phrases. It is just like a <span>,
but directionally *isolates* its content from the surrounding text; bdi
stands for "bi-directional isolate". The effect of the isolation is that
you do not need to use LRM and RLM to separate an opposite-direction phrase
wrapped in <bdi> from a number or a logically separate opposite-direction
phrase that happens to follow it. Since it is actually quite rare *not* to
want to isolate embedded phrases from its surroundings, <bdi> (when the
browser supports it) should be used instead of a <span> for bidi-wrapping,
while the use of LRMs and RLMs can be completely avoided.

Please note that <bdi> also comes with the dir attribute set to the new
"auto" value by default (see below).

dir="auto"

HTML5 also addresses another need: text dropped into a page, say from a
database, when you don't know its base direction. Before HTML5, you could
only set the dir attribute to "ltr" or "rtl", and had to somehow determine
which of them was appropriate yourself. HTML5 provides a new value for the
dir attribute: "auto". The "auto" value tells the browser to look at the
first strongly typed character in the element. If it's a right-to-left
typed character such as a Hebrew or Arabic letter, the element will get a
direction of "rtl". If it's, say, a Latin character, the direction will be
"ltr".

There are corner cases where this may not give the desired outcome, but it
should usually produce the desired result.

Note that the browser ignores any neutral or weak characters at the
beginning of the text when looking for the first strong character. It also
ignores anything inside a bdi element or an element with a dir tag of its
own, including auto.

Furthermore, dir=auto on any element also directionally isolates its
element from its surroundings as if it were a <bdi>. Thus, if you already
have an element like <a> or <cite> wrapping a phrase of unknown direction,
all your bidi wrapping needs are accomplished by adding a dir="auto" on the
existing element.

Not to be outdone, the bdi element behaves as if it has dir=auto by default
(i.e. unless an explicit dir="ltr" or dir="rtl" is specified).

The choice of whether to attach dir="auto" on an existing element or to
wrap the phrase in a <bdi> depends on whether you already have an element
tightly wrapping the potentially opposite-direction phrase, and whether you
happen to know the phrase's direction (or can guess at it better than the
browser's dir="auto" logic).

dir="auto" on <textarea> and <pre>

When used on the <textarea> and <pre> elements, dir="auto" does its
direction estimation for each paragraph of text in the element separately.
If one paragraph starts with an RTL character, and another with an LTR
character, the first will be displayed RTL, and the second in LTR. This
follows the Unicode standard for plain text, in the elements usually used
to enter and display plain text content. When displaying plain text it in a
different element, e.g. a <div> with the "white-space" style set to "pre"
or "pre-wrap", the same effect can be achieved by setting its
"unicode-bidi" style to "plaintext". [I am not sure if the last sentence is
appropriate in an article that mostly ignores CSS]

Putting it all together for HTML5

To summarize, in HTML5, to make sure that a phrase that may have the
opposite direction is displayed correctly, do the following:

1. If you know the phrase's direction (or have a better way of
determining it than the method used by dir=auto), wrap the phrase in <bdi
dir="ltr"> or <bdi dir="rtl">, as appropriate. Do this even if the phrase
has the same direction as the context, just in case it happens to end in
strongly typed characters of the opposite direction, and happens to be
followed by a number or a separate opposite-direction phrase.

2. Otherwise, if the phrase is already tightly wrapped by an element,

3. Otherwise, wrap the phrase in <bdi>. Without an explicit dir value,
dir="auto" is implied.

you think are most illustrative. Each would be entitled by a simple name
describing the example, e.g. "The MAC address", not by complicated
typology. Each would give the recommended solution in HTML4 and HTML5. If
you feel it is necessary, point out the examples that can be fixed by
LRM/RLM alone, when discussing the HTML4 solution. Do not do that for
HTML5, and I am not sure it is worth doing at all.

Aharon

On Tue, Nov 15, 2011 at 5:01 PM, Richard Ishida <ishida@w3.org> wrote:

> Hi Aharon, Mati,
>
> I have just done a first draft pass over the document http://www.w3.org/**
> International/tutorials/new-**bidi-xhtml/Overview-inline.en.**php<http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php>(in particular from here down:
> http://www.w3.org/**International/tutorials/new-**
> bidi-xhtml/Overview-inline.en.**php#where<http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php#where>
> ).
>
> Bearing in mind that this is a first pass, would you mind scanning it and
> letting me know whether you think i'm on the right track?  Hopefully it
> responds to the structure related comments below. (I'm still planning to
> revisit some of the more detailed comments.)
>
> btw, I haven't yet decided what to do with the section entitled "More
> examples".
>
> Thanks!
> RI
>
>
> PS: Any thoughts on this: https://plus.google.com/**
>
>
>
> On 28/10/2011 21:09, Richard Ishida wrote:
>
>> I have begun a substantial reorganization and rewrite of the following
>> section:
>>
>> http://www.w3.org/**International/tutorials/new-**
>> bidi-xhtml/Overview-inline.en.**php#where<http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php#where>
>>
>>
>>
>> RI
>>
>>
>>
>> On 25/10/2011 17:58, Richard Ishida wrote:
>>
>>> Hi Aharon, and thanks for your comments. I was hoping to discuss with
>>> you at the Unicode conf, but that wasn't to be, so here is a quick dash
>>> at my thoughts (since I have to go out soon).
>>>
>>> I actually agree with pretty much everything you say, but the concern I
>>> had was to do with Martin's previous post about the fact that these
>>> things are not yet supported widely, and how to manage expectations in
>>> that regard.
>>>
>>> Even where implementation is there (eg. for dir=auto on Chrome (although
>>> not <bdi> afaict!)) it will be some time before the new constructs can
>>> be relied upon on their own, due to legacy browser usage (esp. IE8).
>>>
>>> My original thought was to 'cordon off' the new stuff into its own
>>> section with a big disclaimer, so that it is clear that this stuff
>>> doesn't work quite yet, and then merge it in to the mainstream gradually
>>> as support increases.
>>>
>>> However, I think you might be right that we should integrate from the
>>> start. The challenge will be to do so in a way that makes it clear to
>>> the reader what currently works and what doesn't.
>>>
>>> That said, I'm still a little worried about the legacy aspect of this.
>>>
>>> I've seen a few places in my own pages where I'm inclined to add
>>> dir=auto or bdi right now, but I know that i will still need to also use
>>> the rlm/lrm for at least a couple of years to cater for the IE8
>>> corporate legacy.
>>>
>>> Using both will be messy, for explanation as well as for content
>>> authoring.
>>>
>>> I'm wondering whether a way around this is to use CSS. For example, in a
>>> LTR page or context, the CSS rule
>>>
>>> bdi:before { content: '\200E '; }
>>>
>>> will cause
>>>
>>> <p>The names of these states in Arabic are <bdi>مصر</bdi>,
>>> <bdi>البحرين</bdi> and <bdi>الكويت</bdi> respectively.</p>
>>>
>>> to display as expected, even if bdi is not supported.
>>>
>>> I suspect we may need to distinguish between cases, such as input
>>> fields, where the rlm/lrm is not appropriate (because it doesn't help),
>>> and situations like the example above, where it can help (either for bdi
>>> or dir=auto).
>>>
>>> Actually, the CSS should probably be genericised to say something like,
>>> if the direction of the parent element is RTL use rlm, and vice versa,
>>> but I think that that capability too is only now being introduced.
>>>
>>> What do you think?
>>>
>>> RI
>>>
>>>
>>>
>>> On 14/10/2011 13:04, Aharon (Vladimir) Lanin wrote:
>>>
>>>> I think that the bdi element and the idea of isolation should appear
>>>> much earlier in the article, long before unknown direction. Basically,
>>>> when you introduce <span dir=...> in "A simple solution" (after "Nesting
>>>> base direction"), you should also mention that HTML5 defines a new
>>>> element, <bdi>, that should be preferred over <span> for this purpose,
>>>> once browsers start to support it, because it also isolates the nested
>>>> phrase from its surroundings, thus preventing it influencing their
>>>> display. You can say that there are examples coming up.
>>>>
>>>> "Adjacent, same-direction directional runs that are incorrectly ordered"
>>>> is an excellent example for the use of <bdi>. I think you should take
>>>> out the sentence "Putting markup around the comma is a bit like cracking
>>>> an egg with a hammer in this case." I think that mark-up generally is
>>>> the preferred solution, when it states something that makes sense. As I
>>>> will explain below, enclosing the comma in a <span dir=ltr> makes no
>>>> sense, and should not even be mentioned, since it will not work. On the
>>>> other hand, enclosing each of the RTL items in the list (but not the
>>>> commas or spaces between them) in a <bdi dir=rtl> makes perfect sense,
>>>> i.e.:
>>>>
>>>> The names of these states in Arabic are <bdi dir="rtl">مصر</bdi>, <bdi
>>>> dir="rtl">البحرين</bdi> and <bdi dir="rtl">الكويت</bdi> respectively.
>>>>
>>>> You can say that in this example, the dir="rtl"s actually don't change
>>>> anything, and in fact that just the first <bdi> is sufficient to fix the
>>>> problem, but there is nothing wrong with marking every embedded
>>>> opposite-direction phrase in a <bdi> - it won't hurt, and will often
>>>> prevent problems.
>>>>
>>>> As I said before, putting a <span dir=ltr> around the comma does not
>>>> make sense, and should not be mentioned at all. Why specifically the
>>>> comma, and not, say the space next to it? Furthermore, a <span dir=...>
>>>> is an /embedding/ - which is not really true for the comma: it's a part
>>>> of the enclosing LTR sentence, not a piece of LTR embedded within - i.e.
>>>> a part of - some RTL. In fact, putting the <span dir=ltr> around the
>>>> comma puts the comma in the wrong place when there is no space between
>>>> it and the RTL text preceding it.
>>>>
>>>> In "More examples", the Hebrew "W3C ... ERCIM" examples should really
>>>> start with "ה-" immediately before the "W3C", i.e. the desired output
>>>> should be:
>>>> ה-W3C‏ (World Wide Web Consortium) מעביר את שירותי הארחה באירופה ל -
>>>> ERCIM.
>>>>
>>>> This too is actually a great place to use <bdi>:
>>>>
>>>> ה-<bdi dir="ltr">W3C</bdi> (<bdi dir="ltr">World Wide Web
>>>> Consortium</bdi>) מעביר את שירותי הארחה באירופה ל-<bdi
>>>> dir="ltr">ERCIM</bdi>.
>>>>
>>>> Once again, you don't actually need the dir="ltr" on any of these, and
>>>> just the first or second <bdi> will be sufficient alone to fix the
>>>> problem, but in principle the safe way to write this sentence is as
>>>> above.
>>>>
>>>> I think that the <bdi> solution - once it is available in browsers - is
>>>> preferable to using &rlm;, because it makes intuitive sense. You simply
>>>> mark the embedded opposite-direction phrases, each one on its own. Until
>>>> someone actually understands the UBA - which very few people do - using
>>>> LRM and RLM seems like voodoo. Few people know when they should use LRM
>>>> and when they should use RLM, and where exactly they should put it.
>>>>
>>>> IMO, the same applies to all the other examples in this section. The
>>>> best way to deal with them, when it becomes available, is <bdi dir=ltr>
>>>> (or just <bdi>, because of dir=auto, but we don't have to mention that
>>>> yet), not an LRM, and not <span dir=ltr>.
>>>>
>>>> In "Handling unknown text", if you are looking for a real RTL book title
>>>> that contains some LTR word(s), but does not begin with them (so that
>>>> dir=auto will work well with it), there is
>>>> :
>>>>
>>>>
>>>> מבוא לתכנות בסביבת אינטרנט - מבוא ו- HTML
>>>>
>>>> Please note that the Google Books page has a bug: the title as displayed
>>>> at the top of the page is always in the direction of the UI. However,
>>>> the title displayed near the bottome of the page, after "Title:" is
>>>> displayed using the word-count direction estimation algorithm. It gets
>>>> this book title right.
>>>>
>>>> to look for Hebrew-language books containing one of the words HTML, CSS,
>>>> and JavaScript, the majority of the book titles I found /began /with the
>>>> LTR word, so dir=auto's first string algorithm does not work well on
>>>> them. I had tried to push through word-count for dir=auto, but failed to
>>>> convince people. Examples:
>>>>
>>>>
>>>> For this reason, I think it is worthwhile to tone down the statement
>>>> that "There are some rare corner cases where this may not give the
>>>> desired outcome, but in the majority of cases it should produce the
>>>> expected result." I would take out the words "some rare", and you could
>>>> also add on "particularly when the embedded text does not mix LTR and
>>>> RTL words and the problem is limited to things like trailing
>>>> punctuation, leading numbers, and phone numbers."
>>>>
>>>> On Thu, Oct 13, 2011 <tel:2011> at 8:09 PM, Richard Ishida
>>>> <ishida@w3.org <mailto:ishida@w3.org>> wrote:
>>>>
>>>> On 19/09/2011 16:04, [Mati] wrote:
>>>>
>>>> http://www.w3.org/__**International/tutorials/new-__**
>>>> bidi-xhtml/qa-html-dir.php<http://www.w3.org/__International/tutorials/new-__bidi-xhtml/qa-html-dir.php>
>>>>
>>>>
>>>> <http://www.w3.org/**International/tutorials/new-**
>>>> bidi-xhtml/qa-html-dir.php<http://www.w3.org/International/tutorials/new-bidi-xhtml/qa-html-dir.php>
>>>> >
>>>>
>>>>
>>>>
>>>> 11) In section "Using dir="auto" with the input element", the first
>>>>
>>>> > Hebrew word of the example is not known to me and is probably a
>>>> typo. I don't even guess what was the intended word.
>>>>
>>>>
>>>> On 20/09/2011 09:38, [Mati] wrote:
>>>>
>>>> http://www.w3.org/__**International/tutorials/new-__**
>>>> bidi-xhtml/Overview-inline.en.**__php<http://www.w3.org/__International/tutorials/new-__bidi-xhtml/Overview-inline.en.__php>
>>>>
>>>>
>>>> <http://www.w3.org/**International/tutorials/new-**
>>>> bidi-xhtml/Overview-inline.en.**php<http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php>
>>>> >
>>>>
>>>>
>>>>
>>>> DON'T show email on public list.
>>>>
>>>> Name: Matitiahu Allouche
>>>> Email:matial@il.ibm.com <mailto:Email%3Amatial@il.ibm.**com<Email%253Amatial@il.ibm.com>
>>>> >
>>>>
>>>>
>>>> This is the continuation of comments that I sent in a previous
>>>> submission.
>>>>
>>>> 18) In section "Second use case", the first Hebrew word of the
>>>> book title differs between its mention in the body of the text
>>>> and its mention in the message. The form in the message is the
>>>> correct one.
>>>>
>>>>
>>>>
>>>> I think I was trying to use the title of the article at
>>>> http://www.w3.org/__**International/questions/qa-__**css-charset.he.php<http://www.w3.org/__International/questions/qa-__css-charset.he.php>
>>>> <http://www.w3.org/**International/questions/qa-**css-charset.he.php<http://www.w3.org/International/questions/qa-css-charset.he.php>
>>>> >
>>>> (though why that's different, I'm not sure). But at the time I only
>>>> grabbed that quickly because i was in a hurry.
>>>>
>>>> Would you or Aharon be able to provide me with a real book title
>>>> that has similar properties? (ie. ending with CSS or some such).
>>>> (Maybe one of these?
>>>> >)
>>>>
>>>> Cheers,
>>>>
>>>> RI
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Richard Ishida
>>>> W3C (World Wide Web Consortium)
>>>>
>>>> http://www.w3.org/__**International/<http://www.w3.org/__International/><
>>>> http://www.w3.org/**International/ <http://www.w3.org/International/>>
>>>> http://rishida.net/
>>>>
>>>>
>>>>
>>>
>>
> --
> Richard Ishida