W3C home > Mailing lists > Public > www-international@w3.org > July to September 2005

Re: Bidi Markup vs Unicode control characters

From: Tex Texin <tex@xencraft.com>
Date: Tue, 02 Aug 2005 14:31:03 -0700
Message-ID: <42EFE617.208B6E3@xencraft.com>
To: Richard Ishida <ishida@w3.org>
CC: 'WWW International' <www-international@w3.org>, Jony Rosenne <rosennej@qsm.co.il>

Hi,

You and Jony both mentioned visibility of the controls. I would expect a
text editor to be able to turn on and off an indicator of such, the same
way indicators for white space (tab, paragraph marks, etc) can be made
visible.

I would like to comment also, that for HTML, TABLE and other elements of
HTML, I do see the need for the DIR attribute. I am not trying to have
the bidi markup deprecated. I am more concerned with straight runs of
text embedded in markup and I don't see why I should give up on WYSIWYG
editing of that to satisfy the recommendation.

Perhaps, the recommendation should be more balanced and indicate a place
for both approaches and not be stipulating one or the other.

The points about inheritance and Frank's remarks on unbalanced controls
should also be taken into account.

I'll have to look up the details of RLM/LRM recommendation.

We should rope into this a discussion the use of the CSS bidi
facilities. Last I looked bidi css were out of favor, but they work very
well for me, and I think its fine to tie markup to language (when I am
not using control codes! ;-) )

Much of the criteria depends on the tools you have a vailable and the
support for bidi and controls display etc.
We might try to identify relevant bidi editors.

tex

Richard Ishida wrote:
> 
> A couple of potential points that spring to mind for now until I have time
> to think this through in more detail:
> 
> [1] We should clarify that the issue here is for the constructs that set
> embedding levels - not the RLM and LRM, where the use of Unicode rather than
> markup is recommended (although these may sometimes be escaped, for
> visibility).
> 
> [2] Will people be able to easily manage the scope and presence/absence of
> such invisible characters easily.
> 
> [3] How do you manage inheritance elegantly using this approach.  There are
> two issues here: a) whether you can replace, say, <html dir="rtl"...> or
> <table dir="rlo"...> effectively with Unicode control characters, and b)
> whether people can tell where these things are easily enough to manage them
> effectively.
> 
> RI
> 
> ============
> Richard Ishida
> W3C
> 
> contact info:
> http://www.w3.org/People/Ishida/
> 
> W3C Internationalization:
> http://www.w3.org/International/
> 
> Publication blog:
> http://people.w3.org/rishida/blog/
> 
> 
> 
> > -----Original Message-----
> > From: www-international-request@w3.org
> > [mailto:www-international-request@w3.org] On Behalf Of Tex Texin
> > Sent: 02 August 2005 12:53
> > To: WWW International
> > Subject: Bidi Markup vs Unicode control characters
> >
> >
> > This has been bothering me for a while and I would like to
> > see if anyone has a better answer as to why we recommend
> > markup over bid controls.
> >
> > The recommendation in HTML 4.0, and also in the joint
> > recommendation of the W3C and Unicode Consortium on Unicode
> > in XML and Markup Languages is that the bidirectional markup
> > is to be preferred over the Unicode control characters.
> >
> > The argument is made in
> > http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2
> > and referenced by http://www.unicode.org/reports/tr20/
> > (section 2+ and 3.1).
> >
> > In particular, HTML 4.0 says:
> > =============
> > Although Unicode specifies special characters that deal with
> > text direction, HTML offers higher-level markup constructs
> > that do the same
> > thing: the dir attribute (do not confuse with the DIR
> > element) and the BDO element. Thus, to express a Hebrew
> > quotation, it is more intuitive to write
> >
> > <Q lang="he" dir="rtl">...a Hebrew quotation...</Q>
> >
> > than the equivalent with Unicode references:
> >
> > &#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;
> > =================
> >
> > Now several years ago, I agreed with this. However, several
> > years ago, most editors had not implemented the Unicode bidi
> > algorithm and did not display bidirectional plain text
> > properly. So we were (at least I was) doing a lot of hand
> > editing and it was not WYSIWYG.
> >
> > Today the situation is very different. Many editors implement
> > and support the Unicode bidirectional algorithm and the
> > associated control codes.
> >
> > So today, we are not faced with markup vs. NCRs. In fact, as
> > I edit Hebrew or Arabic text, I now prefer to use the Unicode
> > control codes, because then the plain text is WYSIWYG and I
> > can see how the result will appear.
> > If I instead use markup controls, when I look at the source
> > of my HTML or XML, it is not WYSIWYG and very difficult to
> > make appropriate bidirectional edits.
> >
> > Using markup instead of the control characters, expands the
> > size of the file.
> >
> > It also now seems to run against the grain of our other I18n
> > recommendations, for example to use character encodings that
> > support all of the characters used in Web documents or
> > applications, so that NCRs are not needed and to enhance readability.
> >
> > Many Web pages are not static and are composed from dynamic
> > elements including databases, localization systems and
> > templates, etc. The components of these systems are often
> > used in multiple ways, sometimes with markup and sometimes
> > with plaintext and other environments. This necessitates
> > careful policies and extra conversions between character and
> > markup choices to satisfy the recommendation.
> >
> > In all of these situations, it makes more sense to me to use
> > the Unicode bidi control codes, and not use markup.
> > For other kinds of controls, where markup offers additional
> > capabilities, readability, etc.
> > the recommendations of TR20 makes sense.
> > But for bidi, there is one to one equivalency, and no
> > advantage, and even some disadvantages (size, loss of WYSIWYG
> > source, etc.)
> >
> > There are also of course many places in HTML where you would
> > like to use bidirectional text, but cannot use markup and are
> > forced to use control codes. (e.g. attributes)
> >
> > I therefore cannot support the recommendation to favor markup
> > over the bidi control characters. It is frankly more
> > expedient to use the control codes, have a single approach
> > for text whether it is in markup, attributes, plaintext, etc.
> > and smaller files, and less processing (conversions to markup
> > and parsing of markup).
> >
> > I would like to hear arguments to the contrary. Actually, I
> > would like to suggest we consider changing the
> > recommendations so that the bidi control codes are favored.
> > It seems to me they are so inherent to text processing that
> > they probably belong as control characters and not in markup
> > at all. (Now that text processors support them.)
> >
> > How do other folks working with bidi, see this?
> >
> > tex
> >

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Tuesday, 2 August 2005 21:40:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT