Re: An issue with the Unicode BiDi Algorithm

Thank you for your reply Richard,

I still find that your reply is more of a presentation of the current views 
of W3C / the Unicode people on the matter.

If this had been an apriori argument about what <br>'s should mean and how 
they should be used, I would find your view a perfectly reasonable 
alternative (although it would mean there would have to be a way to 'break 
a line syntactically' without switching to a new paragraph, which in the 
current scheme of things <br> is not intended to do).

However, there is the huge corpus of existing HTML documents with RTL text 
which all assume an alternative interpretation of <br> - being not just 
white space, but having some semantic significance of a break. True, this 
in some part due to MSIE's conventions, but like I argued in my previous 
e-mails, this assumption has merit for itself. I know, the 1. xxx 2. xxx is 
not the best of examples when considered as HTML, but it is a very common 
example when plain text is displayed as HTML. i.e. , if you were to take 
the following block of text:

-----
Lorem ipsum dolor sit amet, consectetuer
adipiscing elit, sed diam nonummy nibh
euismod tincidunt ut laoreet dolore magna
aliquam erat volutpat.
Ut wisienim ad minim veniam, quis nostrud
exerci tation ullamcorper suscipit lobortis
nisl ut aliquip ex ea commodo consequat.
Duis autem vel eum iriure etc etc
-----

and 'HTML'ify it (this is very common practice, not just in Mozilla and 
other web browsers which feed text to their HTML renderer for display 
and/or edit text messages as HTML, but also in news sites which store 
stories as plain text and use it to prepare their displayed pages) - you 
would not make every line of text a paragraph, on one hand, and usually not 
go to the trouble of running some intelligent classifier to decide where 
paragraphs end (e.g. volutpat) and where they do not (e.g. at consequat). 
What you would normally do is append <br>'s to each line. Thus you see text 
(or html) like

most people in the room don't know the Hebrew word ;DROW<br>
3 people do know it.

or

most people in the room don't know the Hebrew word ;DROW<br>
REHTONA is a better known word.

These are the more representative example - at least from my experience - 
of the use of <br>'s, in comparison to your provided example of

1. xxxxxx xxxx English,<br>
and more xxx.

Because one immediately finds one's self asking: Why put a forced line 
break in the middle of a phrase? If it's a single phrase, authors prefer to 
keep it on the same line; if it is broken by a <br>, the author most likely 
considers it not to be an indivisible whole. And if this <br> comes from 
HTMLizing text, it is usually less confusing to err by endowing the comma 
with the 'general' direction of the text than to err by switching it to the 
other side of 'English'. The reason is that the first type of error does 
nonetheless correspond somewhat to the order of reading (you only need to 
reinterpret an 'end-of-the-line' punctuation mark as an 
'coming-after-the-last-word-in-this-line' punctuation mark) ; while the 
second option, in case of an error, makes you pause when first viewing the 
comma or period thinking that a new clause is beginning, then to move to 
the next line only to find that this is not the case and that 'English' 
belongs to the previous clause or sentence after all, and finally to 
re-read the text while switching the comma or period back to the end of the 
line in your head.

I hope this clears up 'where I'm coming from' in requesting this change. 
This situation seems to me not entirely unlike the problem we had with 
minus as a number separator,

https://bugzilla.mozilla.org/show_bug.cgi?id=73251

, in which the behavior of the minus was modified to accomodate one of its 
common uses in RTL text. There too, although MSIE was 'wrong' w.r.t. the 
Unicode standard's original scheme, it turned out that the 'wrong' way of 
laying out the minus was in general better than the 'right' way. I think we 
would be hard-pressed to find more than a handful, if even that, of a 
people breaking up their "English, and more" with <br>s, especially within 
RTL text.

Eyal

PS - As for audio-only browser, the point about <br>s being discarded is 
sort of moot, since when reading the HTML or text, the punctuation mark 
will always come after the last word if it is written after the last word - 
the 'switching' is done only for visual layout.


Richard Ishida wrote:
>Eyal,
><br> is intended for use as presentation oriented markup, not structural.
>The HTML 4.01 spec clearly describes it as equivalent to a line separator
>(ie. white space), as opposed to a paragraph delimiter[1]. Lines are not
>semantically important in HTML. (You can think of the effect as similar to
>what you'd expect from reducing the width of a window or box containing
>text, but applying that to a single line at a time.)
>The way Internet Explorer handles it looks clever in the context of the
>example being used, but actually produces incorrect results in other
>situations.
>Try this code in a browser:
>First, let's look at the original example from the bugzilla inclusion [for
>the code snippets I replaced body with p, and changed hebrew characters to
>xxxx, and display the text between markup in the order it appears *in
>memory*]:
><p dir="rtl">
>1. xxxxxx xxxx English.<br>
>2. xxx.
></p>
>This produces this in Mozilla:
>         English. xxxx xxxxxx .1
>                         .xxx .2
>and this in IE:
>         .English xxxx xxxxxx .1
>                         .xxx .2
>Now compare that to:
><p dir="rtl">
>1. xxxxxx xxxx English,<br>
>and more xxx.
></p>
>In Mozilla you'll see:
>         English, xxxx xxxxxx .1
>                   .xxx and more
>(which is correct)
>and in IE you'll see:
>         ,English xxxx xxxxxx .1
>                   .xxx and more
>(which is incorrect)
>
>So Mozilla is actually doing the right thing.
>The way to think about this is that the <br> should actually be irrelevant
>for things like voice browsers.  It is only there to force the line to
>visually wrap at a given point.  In this way it is exactly the same as a
>'soft carriage return' or 'forced line break' in other types of software.
>Authors typically use these things for making line lengths the same without
>setting the text box (often a questionable practise).  In translation, for
>example, it is routine to remove all such forced line breaks before
>translation. This is because all the line breaks will occur at different
>vertical alignments in the translated string because text on a line expands
>or contracts at different rates as translated words are substituted. After
>translation, if necessary, forced line breaks are put back in at the
>appropriate places.
>In terms of the bidi algorithm, the code snippets above are equivalent to 
><p dir="rtl">1. xxxxxx xxxx English. 2. xxxx.</p>
>and
><p dir="rtl">1. xxxxxx xxxx English, and more xxxx.</p>
>So how do we get the original example to look how we want?
>Like the old joke "What's the best way to get to <your city>?" "Well, I
>wouldn't start from here." The code is a bad implementation.  If this is a
>numbered list, one ought to use list markup.  If this is output from a text
>box, and creating list elements is really too complicated (although things
>like wiki's manage to figure it out), then enforce the use of carriage
>returns as paragraph separators, and put each line in a separate p element.
>Text input boxes should enforce appropriate behaviour by wrapping lines
>automatically when they are too wide for the box.
>It is possible for the author of the text to produce results that look
>better using &rlm; or its equivalent, but this rides roughshod over the true
>structural problems of the text.
>
>So in summary, the problem is not with the bidi algorithm, nor with <br>, it
>is with the way the text has been marked up.
>Hope that helps,
>RI
>[1] http://www.w3.org/TR/html401/struct/text.html#edef-BR

Received on Saturday, 19 February 2005 03:03:15 UTC