Re: Rework of Bidi inline article

Hi Aharon,

Given that Firefox now also supports bdi, I've been working on the bidi 
articles again.

I began integrating your ideas at

(The old version is still accessible at

However I had some trouble with your summary of how to handle markup in 
HTML5.  I proposed a restructuring of what I thought you were saying, 
that takes into account opposite-direction phrases that are followed by 
a number or by another, but logically separate phrase.

I also suggest a third, alternative approach, which I think is far 
easier for the content author to work with (use bdi and auto pretty much 
all the time).

Below that algorithmic work, you'll see some worked examples for the new 

Please let me know what you think.  Bear in mind that this is still very 


On 22/11/2011 14:17, Aharon (Vladimir) Lanin wrote:
> Sorry that it took me a while to answer. However, most of the time was
> spent on formulating my suggestions below.
> Re
> <>, you
> are correct that people leave a gap and write the number from the
> biggest digit to the smallest, since that is the order in which they
> have the number in their head. You are also correct about the order in
> which math is written in Hebrew and (supposed to be) written in Arabic.
> I do not know Arabic and have never tried writing math right-to-left, so
> I don't have a clue about the hand movements there. I would expect that
> they are indeed a challenge.
> Re the article, I think that the sections looking at solutions for the
> five problematic cases ("Neutral ...", "Weak ...", "Nesting ...",
> "Adjacent ...", and "Handling unknown") taken together are quite long
> and tiring. I believe that this stems from several issues:
>  1. The sections looks like an arbitrary collection of cases, where for
>     each you try a bunch of techniques for fixing, and for some one
>     technique works best, and for some, another. The user is left with
>     the impression that in order to figure out how to deal with the case
>     giving him trouble, he will have to figure out which of these cases
>     his most resembles - a difficult mission for most users. And what if
>     his case resembles more than one of these cases? And what if the
>     user is faced with plopping an arbitrary, unpredictable piece of
>     text into his page? I believe that what is necessary is a clear
>     statement that it is the occurrence of /opposite-direction phrases
>     /that causes all problems, with a concise statement of how to handle
>     an opposite-direction phrase (whatever it may be) to make sure that
>     no problems arise. If this were so, the various cases would be just
>     examples of applying the general approach - and could be safely
>     skipped by a reader that understood the general approach. (There
>     are, of course, two general approaches: one for HTML4, and one for
>     HTML5.)
>  2. Applying each of the HTML4 mark-up, HTML5 mark-up, and LRM/RLM
>     techniques to each of the cases is needlessly repetitive. Let's say
>     that a given technique works for one set of cases, and does not work
>     for another. Within each set, it works (or does not work) exactly
>     the same way for all cases, so mentioning it again and again for
>     each case becomes repetitive.
>  3. The case definitions are needlessly fuzzy. For example, "weak
>     directional characters that appear at the wrong side of a
>     directional run" is a conflation of two very different cases: one
>     where an opposite-direction phrase starts with a number (that is
>     part of it), and one where an opposite-direction phrase is followed
>     by a number (that is not part of it). In HTML4, the two have
>     completely different solutions.
>  4. There may be simply too many cases / examples.
> So, here is my attempt at an alternative way of presenting the material,
> starting from the beginning of "Where the algorithm needs help". I am
> reusing your copy in many places, but watch out where I may have made
> changes.
>     The bidi algorithm will handle text perfectly well in many
>     situations, and often no special markup or other device is needed
>     other than to set the overall direction for the document. However,
>     the more a document mixes text of both directions, the higher the
>     chances that some of it will be displayed not as intended. When this
>     happens, extra mark-up or other devices have to be added to the
>     document to untangle the bidirectional text.
>     We will examine specific examples of what can go wrong, why it goes
>     wrong, and what fixes it in the sections below. Nevertheless, it is
>     important to realize that basically, the problems all occur when a
>     text (e.g. a document) in one direction has to include a phrase in
>     the opposite direction. Common examples of such "phrases" include
>     quotations, formatted numbers (e.g. phone numbers and MAC
>     addresses), addresses, and various names, such as brand names,
>     acronyms, part numbers, site names, articles titles, place names,
>     etc. Whenever an opposite-direction phrase occurs, things can go
>     wrong. That is, something will go wrong if the text includes,
>     without any special "wrapping", an opposite-direction phrase that:
>       * begins or ends with neutral characters
>       * begins with a number
>       * is followed by a number
>       * is followed by another, logically separate opposite-direction phrase
>       * contains one or more nested phrases whose direction is opposite
>         to /it/
>     Although this list seems daunting, there is no need to determine
>     which, if any, of these cases applies to a particular phrase. There
>     are canonical ways of "wrapping" opposite-direction phrases that
>     will prevent problems in all of the cases above, and do no harm when
>     none of them apply. We now describe how such wrapping is done in the
>     current generation of browsers, and in HTML5.
>     Wrapping opposite-direction phrases in HTML4
>     The dir attribute
>     In principle, the right thing to do for /every/ opposite-direction
>     phrase is to set its base direction by using the dir attribute on an
>     element tightly wrapping the phrase. (By "tightly wrapping", we mean
>     that the element contains the entire opposite-direction phrase, and
>     nothing but the opposite-direction phrase.) When none of the cases
>     above apply, this will not have any visible effect. But when one of
>     them does apply, the dir attribute is the right solution.
>     We can see dir in action in the following example, which tries (in
>     the LTR context of this page) to say "an introduction to C++" in
>     Arabic, which should look like "C++ مدخل إلى":
>     ... C++: مدخل إلى C++
>     <span dir="rtl">... C++</span>: ++C مدخل إلى
>     <span dir="rtl">... <p dir="ltr">C++</p></p>: C++ مدخل إلى
>     The first attempt fails with the last word of the phrase, "C++",
>     appearing in the wrong place. This is because our RTL phrase is of
>     opposite direction to the (LTR) context, and contains a nested
>     phrase of the original LTR direction ("C++") inside it. The bidi
>     algorithm, of course, has no way of knowing that the "C++" is part
>     of the RTL phrase, not of the LTR context, and thus displays it as
>     the latter: to the right of the Arabic words instead of to their
>     left. To fix this, we need to wrap the whole phrase in a <span dir=rtl>.
>     That is our second attempt, and it still fails with the "C++" coming
>     out as "++C" instead. This happens because the "C++" is an LTR
>     phrase ending in neutral characters being displayed in the context
>     of our RTL phrase. The bidi algorithm has no way of knowing that the
>     plus signs are part of the LTR phrase, not of the RTL context, and
>     thus displays them as part of the context: to the left of the "C"
>     instead of to its right.
>     Our third attempt finally succeeds. It wraps the overall RTL phrase
>     in a <span dir="rtl">, and the LTR phrase nested inside it in its
>     own <span dir="ltr">.
>     LRM/RLM
>     In addition to the dir attribute, the visual order in which text is
>     displayed can also be modified by using two invisible Unicode
>     control characters: LRM (LEFT-TO-RIGHT-MARK, U+200E, &lrm; as a
>     named entity), and RLM (RIGHT-TO-LEFT-MARK, U+200F, &rlm;). Each has
>     the strong type indicated by its name, but is invisible, like an
>     invisible A and an invisible א.
>     One use of LRM and RLM is to /extend/ a directional run through
>     neutral or weak characters at the start or end of an
>     opposite-direction phrase, by putting a mark of the same direction
>     as the phrase on the other side of the neutral or weak characters.
>     For example, in our Arabic "Introduction to C++" example above,
>     instead of wrapping the "C++" in a <span dir="ltr">, we could add an
>     &lrm; after the second plus:
>     <span dir="rtl">... C++&lrm;</span>: C++ مدخل إلى
>     Being strongly LTR, the LRM extended the LTR run through the neutral
>     pluses.
>     Used this way, however, LRM and RLM are a bit like gotos in
>     programming languages: a quick hack that, unlike the dir attribute,
>     says nothing about the structure of the text. And they simply cannot
>     be used to deal with an opposite-direction phrase that happens to
>     contain a nested phrase in the original direction, like our complete
>     "Introduction to C++" example above. That may seem like an esoteric
>     case, but it is surprisingly common when displaying RTL data in an
>     LTR page, because the use of LTR words (like "C++") is not uncommon
>     in RTL text. So, if you don't want to analyze whether LRM and RLM
>     can replace the use of the dir attribute in /your/ case, just use
>     the dir attribute.
>     Nevertheless, it turns out that LRM and RLM do have an essential
>     function dealing with opposite-direction phrases in HTML4:
>     /separating /an opposite-direction phrase from a number or from a
>     separate opposite-direction phrase that happens to follow it, by
>     putting between them a mark of the same direction as the /context/.
>     When used this way, LRM and RLM do not replace the use of the dir
>     attribute, but augment it.
>     <example for number, e.g. the restaurant example>
>     <example for two separate phrases, e.g. your use case 6>
>     Putting it all together in HTML4
>     To summarize, in HTML4, to make sure that an opposite-direction
>     phrase is displayed correctly, up to two steps are necessary:
>      1. Tightly wrap the opposite-direction phrase in an element that
>         uses the dir attribute to set the direction of the phrase. This
>         is not always necessary, but never does any harm.
>      2. If the opposite-direction phrase is followed (possibly after
>         some intervening neutral characters) by a number or a logically
>         separate opposite-direction phrase, separate the two with a
>         directional mark matching the direction of the context. If you
>         do not want to check whether this is actually the case, you can
>         add a directional mark matching the context's direction after
>         every opposite-direction phrase.
>     Wrapping opposite-direction phrases in HTML5
>     note! This section describes features that are being introduced by
>     HTML5. At the time of writing, these features are not yet widely
>     supported in browsers, but the expectation is that they will be
>     supported soon. In the meantime, you should use these with extreme
>     caution.
>     <bdi>
>     HTML5 introduces a new element, <bdi>, expressly for the purpose of
>     wrapping opposite-direction phrases. It is just like a <span>,
>     but directionally /isolates/ its content from the surrounding text;
>     bdi stands for "bi-directional isolate". The effect of the isolation
>     is that you do not need to use LRM and RLM to separate an
>     opposite-direction phrase wrapped in <bdi> from a number or a
>     logically separate opposite-direction phrase that happens to follow
>     it. Since it is actually quite rare /not/ to want to isolate
>     embedded phrases from its surroundings, <bdi> (when the browser
>     supports it) should be used instead of a <span> for bidi-wrapping,
>     while the use of LRMs and RLMs can be completely avoided.
>     Please note that <bdi> also comes with the dir attribute set to the
>     new "auto" value by default (see below).
>     dir="auto"
>     HTML5 also addresses another need: text dropped into a page, say
>     from a database, when you don't know its base direction. Before
>     HTML5, you could only set the dir attribute to "ltr" or "rtl", and
>     had to somehow determine which of them was appropriate yourself.
>     HTML5 provides a new value for the dir attribute: "auto". The "auto"
>     value tells the browser to look at the first strongly typed
>     character in the element. If it's a right-to-left typed character
>     such as a Hebrew or Arabic letter, the element will get a direction
>     of "rtl". If it's, say, a Latin character, the direction will be "ltr".
>     There are corner cases where this may not give the desired outcome,
>     but it should usually produce the desired result.
>     Note that the browser ignores any neutral or weak characters at the
>     beginning of the text when looking for the first strong character.
>     It also ignores anything inside a bdi element or an element with a
>     dir tag of its own, including auto.
>     Furthermore, dir=auto on any element also directionally isolates its
>     element from its surroundings as if it were a <bdi>. Thus, if you
>     already have an element like <a> or <cite> wrapping a phrase of
>     unknown direction, all your bidi wrapping needs are accomplished by
>     adding a dir="auto" on the existing element.
>     Not to be outdone, the bdi element behaves as if it has dir=auto by
>     default (i.e. unless an explicit dir="ltr" or dir="rtl" is specified).
>     The choice of whether to attach dir="auto" on an existing element or
>     to wrap the phrase in a <bdi> depends on whether you already have an
>     element tightly wrapping the potentially opposite-direction phrase,
>     and whether you happen to know the phrase's direction (or can guess
>     at it better than the browser's dir="auto" logic).
>     dir="auto" on <textarea> and <pre>
>     When used on the <textarea> and <pre> elements, dir="auto" does its
>     direction estimation for each paragraph of text in the element
>     separately. If one paragraph starts with an RTL character, and
>     another with an LTR character, the first will be displayed RTL, and
>     the second in LTR. This follows the Unicode standard for plain text,
>     in the elements usually used to enter and display plain text
>     content. When displaying plain text it in a different element, e.g.
>     a <div> with the "white-space" style set to "pre" or "pre-wrap", the
>     same effect can be achieved by setting its "unicode-bidi" style to
>     "plaintext". [I am not sure if the last sentence is appropriate in
>     an article that mostly ignores CSS]
>     Putting it all together for HTML5
>     To summarize, in HTML5, to make sure that a phrase that may have the
>     opposite direction is displayed correctly, do the following:
>      1. If you know the phrase's direction (or have a better way of
>         determining it than the method used by dir=auto), wrap the
>         phrase in <bdi dir="ltr"> or <bdi dir="rtl">, as appropriate. Do
>         this even if the phrase has the same direction as the context,
>         just in case it happens to end in strongly typed characters of
>         the opposite direction, and happens to be followed by a number
>         or a separate opposite-direction phrase.
>      2. Otherwise, if the phrase is already tightly wrapped by an
>         element, add dir="auto" to the element.
>      3. Otherwise, wrap the phrase in <bdi>. Without an explicit dir
>         value, dir="auto" is implied.
> What would follow is an "Additional examples" section, for whatever
> cases you think are most illustrative. Each would be entitled by a
> simple name describing the example, e.g. "The MAC address", not by
> complicated typology. Each would give the recommended solution in HTML4
> and HTML5. If you feel it is necessary, point out the examples that can
> be fixed by LRM/RLM alone, when discussing the HTML4 solution. Do not do
> that for HTML5, and I am not sure it is worth doing at all.
> Aharon
> On Tue, Nov 15, 2011 <tel:2011> at 5:01 PM, Richard Ishida
> < <>> wrote:
>     Hi Aharon, Mati,
>     I have just done a first draft pass over the document
>     <>
>     (in particular from here down:
>     <>).
>     Bearing in mind that this is a first pass, would you mind scanning
>     it and letting me know whether you think i'm on the right track?
>       Hopefully it responds to the structure related comments below.
>     (I'm still planning to revisit some of the more detailed comments.)
>     btw, I haven't yet decided what to do with the section entitled
>     "More examples".
>     Thanks!
>     RI
>     PS: Any thoughts on this:
>     <> ?
>     On 28/10/2011 21:09, Richard Ishida wrote:
>         I have begun a substantial reorganization and rewrite of the
>         following
>         section:
>         <>
>         RI
>         On 25/10/2011 17:58, Richard Ishida wrote:
>             Hi Aharon, and thanks for your comments. I was hoping to
>             discuss with
>             you at the Unicode conf, but that wasn't to be, so here is a
>             quick dash
>             at my thoughts (since I have to go out soon).
>             I actually agree with pretty much everything you say, but
>             the concern I
>             had was to do with Martin's previous post about the fact
>             that these
>             things are not yet supported widely, and how to manage
>             expectations in
>             that regard.
>             Even where implementation is there (eg. for dir=auto on
>             Chrome (although
>             not <bdi> afaict!)) it will be some time before the new
>             constructs can
>             be relied upon on their own, due to legacy browser usage
>             (esp. IE8).
>             My original thought was to 'cordon off' the new stuff into
>             its own
>             section with a big disclaimer, so that it is clear that this
>             stuff
>             doesn't work quite yet, and then merge it in to the
>             mainstream gradually
>             as support increases.
>             However, I think you might be right that we should integrate
>             from the
>             start. The challenge will be to do so in a way that makes it
>             clear to
>             the reader what currently works and what doesn't.
>             That said, I'm still a little worried about the legacy
>             aspect of this.
>             I've seen a few places in my own pages where I'm inclined to add
>             dir=auto or bdi right now, but I know that i will still need
>             to also use
>             the rlm/lrm for at least a couple of years to cater for the IE8
>             corporate legacy.
>             Using both will be messy, for explanation as well as for content
>             authoring.
>             I'm wondering whether a way around this is to use CSS. For
>             example, in a
>             LTR page or context, the CSS rule
>             bdi:before { content: '\200E '; }
>             will cause
>             <p>The names of these states in Arabic are <bdi>مصر</bdi>,
>             <bdi>البحرين</bdi> and <bdi>الكويت</bdi> respectively.</p>
>             to display as expected, even if bdi is not supported.
>             I suspect we may need to distinguish between cases, such as
>             input
>             fields, where the rlm/lrm is not appropriate (because it
>             doesn't help),
>             and situations like the example above, where it can help
>             (either for bdi
>             or dir=auto).
>             Actually, the CSS should probably be genericised to say
>             something like,
>             if the direction of the parent element is RTL use rlm, and
>             vice versa,
>             but I think that that capability too is only now being
>             introduced.
>             What do you think?
>             RI
>             On 14/10/2011 13:04, Aharon (Vladimir) Lanin wrote:
>                 I think that the bdi element and the idea of isolation
>                 should appear
>                 much earlier in the article, long before unknown
>                 direction. Basically,
>                 when you introduce <span dir=...> in "A simple solution"
>                 (after "Nesting
>                 base direction"), you should also mention that HTML5
>                 defines a new
>                 element, <bdi>, that should be preferred over <span> for
>                 this purpose,
>                 once browsers start to support it, because it also
>                 isolates the nested
>                 phrase from its surroundings, thus preventing it
>                 influencing their
>                 display. You can say that there are examples coming up.
>                 "Adjacent, same-direction directional runs that are
>                 incorrectly ordered"
>                 is an excellent example for the use of <bdi>. I think
>                 you should take
>                 out the sentence "Putting markup around the comma is a
>                 bit like cracking
>                 an egg with a hammer in this case." I think that mark-up
>                 generally is
>                 the preferred solution, when it states something that
>                 makes sense. As I
>                 will explain below, enclosing the comma in a <span
>                 dir=ltr> makes no
>                 sense, and should not even be mentioned, since it will
>                 not work. On the
>                 other hand, enclosing each of the RTL items in the list
>                 (but not the
>                 commas or spaces between them) in a <bdi dir=rtl> makes
>                 perfect sense,
>                 i.e.:
>                 The names of these states in Arabic are <bdi
>                 dir="rtl">مصر</bdi>, <bdi
>                 dir="rtl">البحرين</bdi> and <bdi dir="rtl">الكويت</bdi>
>                 respectively.
>                 You can say that in this example, the dir="rtl"s
>                 actually don't change
>                 anything, and in fact that just the first <bdi> is
>                 sufficient to fix the
>                 problem, but there is nothing wrong with marking every
>                 embedded
>                 opposite-direction phrase in a <bdi> - it won't hurt,
>                 and will often
>                 prevent problems.
>                 As I said before, putting a <span dir=ltr> around the
>                 comma does not
>                 make sense, and should not be mentioned at all. Why
>                 specifically the
>                 comma, and not, say the space next to it? Furthermore, a
>                 <span dir=...>
>                 is an /embedding/ - which is not really true for the
>                 comma: it's a part
>                 of the enclosing LTR sentence, not a piece of LTR
>                 embedded within - i.e.
>                 a part of - some RTL. In fact, putting the <span
>                 dir=ltr> around the
>                 comma puts the comma in the wrong place when there is no
>                 space between
>                 it and the RTL text preceding it.
>                 In "More examples", the Hebrew "W3C ... ERCIM" examples
>                 should really
>                 start with "ה-" immediately before the "W3C", i.e. the
>                 desired output
>                 should be:
>                 ה-W3C‏ (World Wide Web Consortium) מעביר את שירותי הארחה
>                 באירופה ל -
>                 ERCIM.
>                 This too is actually a great place to use <bdi>:
>                 ה-<bdi dir="ltr">W3C</bdi> (<bdi dir="ltr">World Wide Web
>                 Consortium</bdi>) מעביר את שירותי הארחה באירופה ל-<bdi
>                 dir="ltr">ERCIM</bdi>.
>                 Once again, you don't actually need the dir="ltr" on any
>                 of these, and
>                 just the first or second <bdi> will be sufficient alone
>                 to fix the
>                 problem, but in principle the safe way to write this
>                 sentence is as
>                 above.
>                 I think that the <bdi> solution - once it is available
>                 in browsers - is
>                 preferable to using &rlm;, because it makes intuitive
>                 sense. You simply
>                 mark the embedded opposite-direction phrases, each one
>                 on its own. Until
>                 someone actually understands the UBA - which very few
>                 people do - using
>                 LRM and RLM seems like voodoo. Few people know when they
>                 should use LRM
>                 and when they should use RLM, and where exactly they
>                 should put it.
>                 IMO, the same applies to all the other examples in this
>                 section. The
>                 best way to deal with them, when it becomes available,
>                 is <bdi dir=ltr>
>                 (or just <bdi>, because of dir=auto, but we don't have
>                 to mention that
>                 yet), not an LRM, and not <span dir=ltr>.
>                 In "Handling unknown text", if you are looking for a
>                 real RTL book title
>                 that contains some LTR word(s), but does not begin with
>                 them (so that
>                 dir=auto will work well with it), there is
>                 <>:
>                 מבוא לתכנות בסביבת אינטרנט - מבוא ו- HTML
>                 Please note that the Google Books page has a bug: the
>                 title as displayed
>                 at the top of the page is always in the direction of the
>                 UI. However,
>                 the title displayed near the bottome of the page, after
>                 "Title:" is
>                 displayed using the word-count direction estimation
>                 algorithm. It gets
>                 this book title right.
>                 Furthermore, please note that when I used Google Books'
>                 Advanced Search
>                 to look for Hebrew-language books containing one of the
>                 words HTML, CSS,
>                 and JavaScript, the majority of the book titles I found
>                 /began /with the
>                 LTR word, so dir=auto's first string algorithm does not
>                 work well on
>                 them. I had tried to push through word-count for
>                 dir=auto, but failed to
>                 convince people. Examples:
>                 <>
>                 <>
>                 <>
>                 For this reason, I think it is worthwhile to tone down
>                 the statement
>                 that "There are some rare corner cases where this may
>                 not give the
>                 desired outcome, but in the majority of cases it should
>                 produce the
>                 expected result." I would take out the words "some
>                 rare", and you could
>                 also add on "particularly when the embedded text does
>                 not mix LTR and
>                 RTL words and the problem is limited to things like trailing
>                 punctuation, leading numbers, and phone numbers."
>                 On Thu, Oct 13, 2011 <tel:2011> <tel:2011 <tel:2011>> at
>                 8:09 PM, Richard Ishida
>                 < <>
>                 < <>>> wrote:
>                 On 19/09/2011 16:04, [Mati] wrote:
>                 <>
>                 <
>                 <>>
>                 11) In section "Using dir="auto" with the input
>                 element", the first
>                  > Hebrew word of the example is not known to me and is
>                 probably a
>                 typo. I don't even guess what was the intended word.
>                 On 20/09/2011 09:38, [Mati] wrote:
>                 <>
>                 <
>                 <>>
>                 DON'T show email on public list.
>                 Name: Matitiahu Allouche
>                 <>
>                 <
>                 <>>
>                 Comments:
>                 This is the continuation of comments that I sent in a
>                 previous
>                 submission.
>                 18) In section "Second use case", the first Hebrew word
>                 of the
>                 book title differs between its mention in the body of
>                 the text
>                 and its mention in the message. The form in the message
>                 is the
>                 correct one.
>                 I think I was trying to use the title of the article at
>                 <>
>                 <
>                 <>>
>                 (though why that's different, I'm not sure). But at the
>                 time I only
>                 grabbed that quickly because i was in a hurry.
>                 Would you or Aharon be able to provide me with a real
>                 book title
>                 that has similar properties? (ie. ending with CSS or
>                 some such).
>                 (Maybe one of these?
>                 <>
>                 <
>                 <>>)
>                 Cheers,
>                 RI
>                 --
>                 Richard Ishida
>                 Internationalization Activity Lead
>                 W3C (World Wide Web Consortium)
>                 <>
>                 <
>                 <>>
>     --
>     Richard Ishida
>     Internationalization Activity Lead
>     W3C (World Wide Web Consortium)
> <>

Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

Received on Thursday, 2 February 2012 16:05:13 UTC