W3C home > Mailing lists > Public > www-international@w3.org > July to September 2005

Re: Bidi Markup vs Unicode control characters

From: Frank Yung-Fong Tang <ytang0648@aol.com>
Date: Tue, 2 Aug 2005 10:39:03 -0400
To: "Tex Texin" <tex@xencraft.com>
cc: "WWW International" <www-international@w3.org>
Message-ID: <42EF8587.1010307@aol.com>

Thanks for Tex to bring this up. I would like to point out another 
'changes' from several years ago:

In Unicode 2.0 (1996): The use of ISS (U+206A) and ASS (U+206B) will 
impact the display / interpretation of '<' and '>'

The ISS and ASS is deprecated in Unicode 3.0 (2000) and Unicode 4.0 (2003).

I still remember those fun days I have to play with ISS/ASS in the 
Apple's TEC code. It is not trival. And for sure they do not live well 
with '<' and '>' in the markup languages.

Another thing we should consider is how to ensure the integrity of the 
document. For example, if the document use control code instead of 
markup, what should the browser do if the text value of an attribute RLE 
but never PDF? say

text &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4; text

instead of

text &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C; text

By using
<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

this kind of 'missing pdf' will never happen.

But in the mean time, we still need to use NCR inside the attribute 
value anyway, right?

Tex Texin wrote on 8/2/2005, 7:53 AM:

 > This has been bothering me for a while and I would like to see if anyone
 > has a better answer as to why we recommend markup over bid controls.
 > The recommendation in HTML 4.0, and also in the joint recommendation of
 > the W3C and Unicode Consortium on Unicode in XML and Markup Languages is
 > that the bidirectional markup is to be preferred over the Unicode
 > control characters.
 > The argument is made in
 > http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2
 > and referenced by http://www.unicode.org/reports/tr20/ (section 2+ and
 > 3.1).
 > In particular, HTML 4.0 says:
 > =============
 > Although Unicode specifies special characters that deal with text
 > direction, HTML offers higher-level markup constructs that do the same
 > thing: the dir attribute (do not confuse with the DIR element) and the
 > BDO element. Thus, to express a Hebrew quotation, it is more intuitive
 > to write
 > <Q lang="he" dir="rtl">...a Hebrew quotation...</Q>
 > than the equivalent with Unicode references:
 > &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;
 > =================
 > Now several years ago, I agreed with this. However, several years ago,
 > most editors had not implemented the Unicode bidi algorithm and did not
 > display bidirectional plain text properly. So we were (at least I was)
 > doing a lot of hand editing and it was not WYSIWYG.
 > Today the situation is very different. Many editors implement and
 > support the Unicode bidirectional algorithm and the associated control
 > codes.
 > So today, we are not faced with markup vs. NCRs. In fact, as I edit
 > Hebrew or Arabic text, I now prefer to use the Unicode control codes,
 > because then the plain text is WYSIWYG and I can see how the result will
 > appear.
 > If I instead use markup controls, when I look at the source of my HTML
 > or XML, it is not WYSIWYG and very difficult to make appropriate
 > bidirectional edits.
 > Using markup instead of the control characters, expands the size of the
 > file.
 > It also now seems to run against the grain of our other I18n
 > recommendations, for example to use character encodings that support all
 > of the characters used in Web documents or applications, so that NCRs
 > are not needed and to enhance readability.
 > Many Web pages are not static and are composed from dynamic elements
 > including databases, localization systems and templates, etc. The
 > components of these systems are often used in multiple ways, sometimes
 > with markup and sometimes with plaintext and other environments. This
 > necessitates careful policies and extra conversions between character
 > and markup choices to satisfy the recommendation.
 > In all of these situations, it makes more sense to me to use the Unicode
 > bidi control codes, and not use markup.
 > For other kinds of controls, where markup offers additional
 > capabilities, readability, etc.
 > the recommendations of TR20 makes sense.
 > But for bidi, there is one to one equivalency, and no advantage, and
 > even some disadvantages (size, loss of WYSIWYG source, etc.)
 > There are also of course many places in HTML where you would like to use
 > bidirectional text, but cannot use markup and are forced to use control
 > codes. (e.g. attributes)
 > I therefore cannot support the recommendation to favor markup over the
 > bidi control characters. It is frankly more expedient to use the control
 > codes, have a single approach for text whether it is in markup,
 > attributes, plaintext, etc. and smaller files, and less processing
 > (conversions to markup and parsing of markup).
 > I would like to hear arguments to the contrary. Actually, I would like
 > to suggest we consider changing the recommendations so that the bidi
 > control codes are favored. It seems to me they are so inherent to text
 > processing that they probably belong as control characters and not in
 > markup at all. (Now that text processors support them.)
 > How do other folks working with bidi, see this?
 > tex
Received on Tuesday, 2 August 2005 14:39:44 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT