RE: Bidi Markup vs Unicode control characters from Jony Rosenne on 2005-08-02 (www-international@w3.org from July to September 2005)

From: Jony Rosenne <rosennej@qsm.co.il>
Date: Tue, 2 Aug 2005 20:42:47 +0200
To: "'WWW International'" <www-international@w3.org>
Message-ID: <003a01c59791$fb679990$0b00000a@QSM7>
1) For block-level elements, Unicode control characters do not provide an
answer. The Unicode bidi algorithm specifies a work around when a higher
level protocol does not provide the base directionality, and this work
around is quite often unsatisfactory.

2) For inline elements, such as your example, markup is equivalent and it's
a matter of taste. See the note at the end of 8.2.3. I prefer markup because
it is visible.

Jony

> -----Original Message-----
> From: www-international-request@w3.org 
> [mailto:www-international-request@w3.org] On Behalf Of Tex Texin
> Sent: Tuesday, August 02, 2005 1:53 PM
> To: WWW International
> Subject: Bidi Markup vs Unicode control characters
> 
> 
> 
> This has been bothering me for a while and I would like to 
> see if anyone
> has a better answer as to why we recommend markup over bid controls.
> 
> The recommendation in HTML 4.0, and also in the joint 
> recommendation of
> the W3C and Unicode Consortium on Unicode in XML and Markup 
> Languages is
> that the bidirectional markup is to be preferred over the Unicode
> control characters.
> 
> The argument is made in
> http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2
> and referenced by http://www.unicode.org/reports/tr20/ (section 2+ and
> 3.1).
> 
> In particular, HTML 4.0 says:
> =============
> Although Unicode specifies special characters that deal with text
> direction, HTML offers higher-level markup constructs that do the same
> thing: the dir attribute (do not confuse with the DIR element) and the
> BDO element. Thus, to express a Hebrew quotation, it is more intuitive
> to write
> 
> <Q lang="he" dir="rtl">...a Hebrew quotation...</Q>
> 
> than the equivalent with Unicode references:
> 
> &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;
> =================
> 
> Now several years ago, I agreed with this. However, several years ago,
> most editors had not implemented the Unicode bidi algorithm 
> and did not
> display bidirectional plain text properly. So we were (at least I was)
> doing a lot of hand editing and it was not WYSIWYG.
> 
> Today the situation is very different. Many editors implement and
> support the Unicode bidirectional algorithm and the associated control
> codes.
> 
> So today, we are not faced with markup vs. NCRs. In fact, as I edit
> Hebrew or Arabic text, I now prefer to use the Unicode control codes,
> because then the plain text is WYSIWYG and I can see how the 
> result will
> appear.
> If I instead use markup controls, when I look at the source of my HTML
> or XML, it is not WYSIWYG and very difficult to make appropriate
> bidirectional edits.
> 
> Using markup instead of the control characters, expands the 
> size of the
> file.
> 
> It also now seems to run against the grain of our other I18n
> recommendations, for example to use character encodings that 
> support all
> of the characters used in Web documents or applications, so that NCRs
> are not needed and to enhance readability.
> 
> Many Web pages are not static and are composed from dynamic elements
> including databases, localization systems and templates, etc. The
> components of these systems are often used in multiple ways, sometimes
> with markup and sometimes with plaintext and other environments. This
> necessitates careful policies and extra conversions between character
> and markup choices to satisfy the recommendation.
> 
> In all of these situations, it makes more sense to me to use 
> the Unicode
> bidi control codes, and not use markup.
> For other kinds of controls, where markup offers additional
> capabilities, readability, etc. 
> the recommendations of TR20 makes sense.
> But for bidi, there is one to one equivalency, and no advantage, and
> even some disadvantages (size, loss of WYSIWYG source, etc.)
> 
> There are also of course many places in HTML where you would 
> like to use
> bidirectional text, but cannot use markup and are forced to 
> use control
> codes. (e.g. attributes)
> 
> I therefore cannot support the recommendation to favor markup over the
> bidi control characters. It is frankly more expedient to use 
> the control
> codes, have a single approach for text whether it is in markup,
> attributes, plaintext, etc. and smaller files, and less processing
> (conversions to markup and parsing of markup).
> 
> I would like to hear arguments to the contrary. Actually, I would like
> to suggest we consider changing the recommendations so that the bidi
> control codes are favored. It seems to me they are so inherent to text
> processing that they probably belong as control characters and not in
> markup at all. (Now that text processors support them.)
> 
> How do other folks working with bidi, see this?
> 
> tex
> 
> 
> 
>
Received on Tuesday, 2 August 2005 17:43:59 UTC