RE: Bidi Markup vs Unicode control characters from Richard Ishida on 2005-08-02 (www-international@w3.org from July to September 2005)

From: Richard Ishida <ishida@w3.org>
Date: Tue, 2 Aug 2005 16:23:52 +0100
To: "'Tex Texin'" <tex@xencraft.com>, "'WWW International'" <www-international@w3.org>
Message-Id: <20050802152350.87B274F04C@homer.w3.org>
A couple of potential points that spring to mind for now until I have time
to think this through in more detail:

[1] We should clarify that the issue here is for the constructs that set
embedding levels - not the RLM and LRM, where the use of Unicode rather than
markup is recommended (although these may sometimes be escaped, for
visibility).

[2] Will people be able to easily manage the scope and presence/absence of
such invisible characters easily.

[3] How do you manage inheritance elegantly using this approach.  There are
two issues here: a) whether you can replace, say, <html dir="rtl"...> or
<table dir="rlo"...> effectively with Unicode control characters, and b)
whether people can tell where these things are easily enough to manage them
effectively.

RI


============
Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 
 

> -----Original Message-----
> From: www-international-request@w3.org 
> [mailto:www-international-request@w3.org] On Behalf Of Tex Texin
> Sent: 02 August 2005 12:53
> To: WWW International
> Subject: Bidi Markup vs Unicode control characters
> 
> 
> This has been bothering me for a while and I would like to 
> see if anyone has a better answer as to why we recommend 
> markup over bid controls.
> 
> The recommendation in HTML 4.0, and also in the joint 
> recommendation of the W3C and Unicode Consortium on Unicode 
> in XML and Markup Languages is that the bidirectional markup 
> is to be preferred over the Unicode control characters.
> 
> The argument is made in
> http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2
> and referenced by http://www.unicode.org/reports/tr20/ 
> (section 2+ and 3.1).
> 
> In particular, HTML 4.0 says:
> =============
> Although Unicode specifies special characters that deal with 
> text direction, HTML offers higher-level markup constructs 
> that do the same
> thing: the dir attribute (do not confuse with the DIR 
> element) and the BDO element. Thus, to express a Hebrew 
> quotation, it is more intuitive to write
> 
> <Q lang="he" dir="rtl">...a Hebrew quotation...</Q>
> 
> than the equivalent with Unicode references:
> 
> &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C; 
> =================
> 
> Now several years ago, I agreed with this. However, several 
> years ago, most editors had not implemented the Unicode bidi 
> algorithm and did not display bidirectional plain text 
> properly. So we were (at least I was) doing a lot of hand 
> editing and it was not WYSIWYG.
> 
> Today the situation is very different. Many editors implement 
> and support the Unicode bidirectional algorithm and the 
> associated control codes.
> 
> So today, we are not faced with markup vs. NCRs. In fact, as 
> I edit Hebrew or Arabic text, I now prefer to use the Unicode 
> control codes, because then the plain text is WYSIWYG and I 
> can see how the result will appear.
> If I instead use markup controls, when I look at the source 
> of my HTML or XML, it is not WYSIWYG and very difficult to 
> make appropriate bidirectional edits.
> 
> Using markup instead of the control characters, expands the 
> size of the file.
> 
> It also now seems to run against the grain of our other I18n 
> recommendations, for example to use character encodings that 
> support all of the characters used in Web documents or 
> applications, so that NCRs are not needed and to enhance readability.
> 
> Many Web pages are not static and are composed from dynamic 
> elements including databases, localization systems and 
> templates, etc. The components of these systems are often 
> used in multiple ways, sometimes with markup and sometimes 
> with plaintext and other environments. This necessitates 
> careful policies and extra conversions between character and 
> markup choices to satisfy the recommendation.
> 
> In all of these situations, it makes more sense to me to use 
> the Unicode bidi control codes, and not use markup.
> For other kinds of controls, where markup offers additional 
> capabilities, readability, etc. 
> the recommendations of TR20 makes sense.
> But for bidi, there is one to one equivalency, and no 
> advantage, and even some disadvantages (size, loss of WYSIWYG 
> source, etc.)
> 
> There are also of course many places in HTML where you would 
> like to use bidirectional text, but cannot use markup and are 
> forced to use control codes. (e.g. attributes)
> 
> I therefore cannot support the recommendation to favor markup 
> over the bidi control characters. It is frankly more 
> expedient to use the control codes, have a single approach 
> for text whether it is in markup, attributes, plaintext, etc. 
> and smaller files, and less processing (conversions to markup 
> and parsing of markup).
> 
> I would like to hear arguments to the contrary. Actually, I 
> would like to suggest we consider changing the 
> recommendations so that the bidi control codes are favored. 
> It seems to me they are so inherent to text processing that 
> they probably belong as control characters and not in markup 
> at all. (Now that text processors support them.)
> 
> How do other folks working with bidi, see this?
> 
> tex
>
Received on Tuesday, 2 August 2005 15:23:59 UTC