- From: Tex Texin <tex@xencraft.com>
- Date: Tue, 02 Aug 2005 04:53:14 -0700
- To: WWW International <www-international@w3.org>
This has been bothering me for a while and I would like to see if anyone has a better answer as to why we recommend markup over bid controls. The recommendation in HTML 4.0, and also in the joint recommendation of the W3C and Unicode Consortium on Unicode in XML and Markup Languages is that the bidirectional markup is to be preferred over the Unicode control characters. The argument is made in http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2 and referenced by http://www.unicode.org/reports/tr20/ (section 2+ and 3.1). In particular, HTML 4.0 says: ============= Although Unicode specifies special characters that deal with text direction, HTML offers higher-level markup constructs that do the same thing: the dir attribute (do not confuse with the DIR element) and the BDO element. Thus, to express a Hebrew quotation, it is more intuitive to write <Q lang="he" dir="rtl">...a Hebrew quotation...</Q> than the equivalent with Unicode references: ‫״...a Hebrew quotation...״‬ ================= Now several years ago, I agreed with this. However, several years ago, most editors had not implemented the Unicode bidi algorithm and did not display bidirectional plain text properly. So we were (at least I was) doing a lot of hand editing and it was not WYSIWYG. Today the situation is very different. Many editors implement and support the Unicode bidirectional algorithm and the associated control codes. So today, we are not faced with markup vs. NCRs. In fact, as I edit Hebrew or Arabic text, I now prefer to use the Unicode control codes, because then the plain text is WYSIWYG and I can see how the result will appear. If I instead use markup controls, when I look at the source of my HTML or XML, it is not WYSIWYG and very difficult to make appropriate bidirectional edits. Using markup instead of the control characters, expands the size of the file. It also now seems to run against the grain of our other I18n recommendations, for example to use character encodings that support all of the characters used in Web documents or applications, so that NCRs are not needed and to enhance readability. Many Web pages are not static and are composed from dynamic elements including databases, localization systems and templates, etc. The components of these systems are often used in multiple ways, sometimes with markup and sometimes with plaintext and other environments. This necessitates careful policies and extra conversions between character and markup choices to satisfy the recommendation. In all of these situations, it makes more sense to me to use the Unicode bidi control codes, and not use markup. For other kinds of controls, where markup offers additional capabilities, readability, etc. the recommendations of TR20 makes sense. But for bidi, there is one to one equivalency, and no advantage, and even some disadvantages (size, loss of WYSIWYG source, etc.) There are also of course many places in HTML where you would like to use bidirectional text, but cannot use markup and are forced to use control codes. (e.g. attributes) I therefore cannot support the recommendation to favor markup over the bidi control characters. It is frankly more expedient to use the control codes, have a single approach for text whether it is in markup, attributes, plaintext, etc. and smaller files, and less processing (conversions to markup and parsing of markup). I would like to hear arguments to the contrary. Actually, I would like to suggest we consider changing the recommendations so that the bidi control codes are favored. It seems to me they are so inherent to text processing that they probably belong as control characters and not in markup at all. (Now that text processors support them.) How do other folks working with bidi, see this? tex
Received on Tuesday, 2 August 2005 11:54:29 UTC