Bidi Markup vs Unicode control characters

This has been bothering me for a while and I would like to see if anyone
has a better answer as to why we recommend markup over bid controls.

The recommendation in HTML 4.0, and also in the joint recommendation of
the W3C and Unicode Consortium on Unicode in XML and Markup Languages is
that the bidirectional markup is to be preferred over the Unicode
control characters.

The argument is made in
http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2
and referenced by http://www.unicode.org/reports/tr20/ (section 2+ and
3.1).

In particular, HTML 4.0 says:
=============
Although Unicode specifies special characters that deal with text
direction, HTML offers higher-level markup constructs that do the same
thing: the dir attribute (do not confuse with the DIR element) and the
BDO element. Thus, to express a Hebrew quotation, it is more intuitive
to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

than the equivalent with Unicode references:

&#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;
=================

Now several years ago, I agreed with this. However, several years ago,
most editors had not implemented the Unicode bidi algorithm and did not
display bidirectional plain text properly. So we were (at least I was)
doing a lot of hand editing and it was not WYSIWYG.

Today the situation is very different. Many editors implement and
support the Unicode bidirectional algorithm and the associated control
codes.

So today, we are not faced with markup vs. NCRs. In fact, as I edit
Hebrew or Arabic text, I now prefer to use the Unicode control codes,
because then the plain text is WYSIWYG and I can see how the result will
appear.
If I instead use markup controls, when I look at the source of my HTML
or XML, it is not WYSIWYG and very difficult to make appropriate
bidirectional edits.

Using markup instead of the control characters, expands the size of the
file.

It also now seems to run against the grain of our other I18n
recommendations, for example to use character encodings that support all
of the characters used in Web documents or applications, so that NCRs
are not needed and to enhance readability.

Many Web pages are not static and are composed from dynamic elements
including databases, localization systems and templates, etc. The
components of these systems are often used in multiple ways, sometimes
with markup and sometimes with plaintext and other environments. This
necessitates careful policies and extra conversions between character
and markup choices to satisfy the recommendation.

In all of these situations, it makes more sense to me to use the Unicode
bidi control codes, and not use markup.
For other kinds of controls, where markup offers additional
capabilities, readability, etc. 
the recommendations of TR20 makes sense.
But for bidi, there is one to one equivalency, and no advantage, and
even some disadvantages (size, loss of WYSIWYG source, etc.)

There are also of course many places in HTML where you would like to use
bidirectional text, but cannot use markup and are forced to use control
codes. (e.g. attributes)

I therefore cannot support the recommendation to favor markup over the
bidi control characters. It is frankly more expedient to use the control
codes, have a single approach for text whether it is in markup,
attributes, plaintext, etc. and smaller files, and less processing
(conversions to markup and parsing of markup).

I would like to hear arguments to the contrary. Actually, I would like
to suggest we consider changing the recommendations so that the bidi
control codes are favored. It seems to me they are so inherent to text
processing that they probably belong as control characters and not in
markup at all. (Now that text processors support them.)

How do other folks working with bidi, see this?

tex

Received on Tuesday, 2 August 2005 11:54:29 UTC