W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2005

RE: An issue with the Unicode BiDi Algorithm

From: Richard Ishida <ishida@w3.org>
Date: Wed, 16 Feb 2005 11:15:16 -0000
To: <public-i18n-core@w3.org>
Message-Id: <20050216111516.36BDF4F1AA@homer.w3.org>

I saw that Addison had not sent anything out on this, so I thought I'd give
it a try. Here is my proposed text.  Any objections to me sending this to
Eyal, bidi@unicore and pointing bugzilla to it in our archive?

RI
===============================



Eyal,

<br> is intended for use as presentation oriented markup, not structural.
The HTML 4.01 spec clearly describes it as equivalent to a line separator
(ie. white space), as opposed to a paragraph delimiter[1]. Lines are not
semantically important in HTML. (You can think of the effect as similar to
what you'd expect from reducing the width of a window or box containing
text, but applying that to a single line at a time.)

The way Internet Explorer handles it looks clever in the context of the
example being used, but actually produces incorrect results in other
situations.

Try this code in a browser:

First, let's look at the original example from the bugzilla inclusion [for
the code snippets I replaced body with p, and changed hebrew characters to
xxxx, and display the text between markup in the order it appears *in
memory*]:
<p dir="rtl">
1. xxxxxx xxxx English.<br>
2. xxx.
</p>

This produces this in Mozilla:
        English. xxxx xxxxxx .1
                        .xxx .2

and this in IE:
        .English xxxx xxxxxx .1
                        .xxx .2

Now compare that to:

<p dir="rtl">
1. xxxxxx xxxx English,<br>
and more xxx.
</p>

In Mozilla you'll see:
        English, xxxx xxxxxx .1
                  .xxx and more
(which is correct)

and in IE you'll see:
        ,English xxxx xxxxxx .1
                  .xxx and more

(which is incorrect)


So Mozilla is actually doing the right thing.  

The way to think about this is that the <br> should actually be irrelevant
for things like voice browsers.  It is only there to force the line to
visually wrap at a given point.  In this way it is exactly the same as a
'soft carriage return' or 'forced line break' in other types of software.
Authors typically use these things for making line lengths the same without
setting the text box (often a questionable practise).  In translation, for
example, it is routine to remove all such forced line breaks before
translation. This is because all the line breaks will occur at different
vertical alignments in the translated string because text on a line expands
or contracts at different rates as translated words are substituted. After
translation, if necessary, forced line breaks are put back in at the
appropriate places.

In terms of the bidi algorithm, the code above is equivalent to 
<p dir="rtl">1. ????? ??????? English. 2. ?????.</p>

and
<p dir="rtl">1. ????? ??????? English, and more ?????.</p>

So how do we get the original example to look how we want?

Like the old joke "What's the best way to get to <your city>?" "Well, I
wouldn't start from here." The code is a bad implementation.  If this is a
numbered list, one ought to use list markup.  If this is output from a text
box, and creating list elements is really too complicated (although things
like wiki's manage to figure it out), then enforce the use of carriage
returns as paragraph separators, and put each line in a separate p element.
Text input boxes should enforce appropriate behaviour by wrapping lines
automatically.

It is possible for the author of the text to produce results that look
better using &rtl; or its equivalent, but this rides roughshod over the true
structural problems of the text.


So in summary, the problem is not with the bidi algorithm, nor with <br>, it
is with the way the text has been marked up.

Hope that helps,
RI

[1] http://www.w3.org/TR/html401/struct/text.html#edef-BR


============
Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 
 

> -----Original Message-----
> From: public-i18n-core-request@w3.org 
> [mailto:public-i18n-core-request@w3.org] On Behalf Of Eyal 
> Rozenberg (by way of Martin Duerst <duerst@w3.org>)
> Sent: 16 February 2005 04:45
> To: public-i18n-core@w3.org
> Subject: Re: An issue with the Unicode BiDi Algorithm
> 
> 
> 
> 
> 
> Mark Davis wrote:
> >...  it appears to be that the bug filers want to treat <br> 
> as if it 
> >really does start another paragraph, but one without 
> paragraph spacing.
>  > ...
> >So it seems like what the people really want would be to use a <p 
> >style="margin:0"> instead of a <br>.
> 
> But then, why should <br> ever be used? What I mean is, <p>'s 
> and <br>'s have semantic significance, they're not just 
> vehicles for visual style which you override with something 
> like "margin:0". e.g. I may want to break a line without 
> breaking the paragraph, and it is reasonable for me to want 
> to write an RTL sentence which ends with an LTR word before 
> the period on the first line, followed on the next line by an 
> RTL sentence which happens to begin with, say, a number. They 
> may be two sentences forming a single paragraph semantically, 
> which should not have to be split up just so as to display 
> like one would expect them to.
> 
> So, is there some compelling reason why neutrals at ends of 
> lines should not have the same direction as that of the 
> paragraph (with no control characters present of course)?
> 
> Eyal
> 
> PS - I'm assuming it is appropriate for me to also CC the two 
> mailing lists; if that is not the case, please let me know.
> 
> 
> 
Received on Wednesday, 16 February 2005 11:15:18 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:49 GMT