- From: by way of Martin Duerst <eyalroz@technion.ac.il>
- Date: Sat, 19 Feb 2005 12:02:40 +0900
- To: public-i18n-core@w3.org
Thank you for your reply Richard, I still find that your reply is more of a presentation of the current views of W3C / the Unicode people on the matter. If this had been an apriori argument about what <br>'s should mean and how they should be used, I would find your view a perfectly reasonable alternative (although it would mean there would have to be a way to 'break a line syntactically' without switching to a new paragraph, which in the current scheme of things <br> is not intended to do). However, there is the huge corpus of existing HTML documents with RTL text which all assume an alternative interpretation of <br> - being not just white space, but having some semantic significance of a break. True, this in some part due to MSIE's conventions, but like I argued in my previous e-mails, this assumption has merit for itself. I know, the 1. xxx 2. xxx is not the best of examples when considered as HTML, but it is a very common example when plain text is displayed as HTML. i.e. , if you were to take the following block of text: ----- Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisienim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure etc etc ----- and 'HTML'ify it (this is very common practice, not just in Mozilla and other web browsers which feed text to their HTML renderer for display and/or edit text messages as HTML, but also in news sites which store stories as plain text and use it to prepare their displayed pages) - you would not make every line of text a paragraph, on one hand, and usually not go to the trouble of running some intelligent classifier to decide where paragraphs end (e.g. volutpat) and where they do not (e.g. at consequat). What you would normally do is append <br>'s to each line. Thus you see text (or html) like most people in the room don't know the Hebrew word ;DROW<br> 3 people do know it. or most people in the room don't know the Hebrew word ;DROW<br> REHTONA is a better known word. These are the more representative example - at least from my experience - of the use of <br>'s, in comparison to your provided example of 1. xxxxxx xxxx English,<br> and more xxx. Because one immediately finds one's self asking: Why put a forced line break in the middle of a phrase? If it's a single phrase, authors prefer to keep it on the same line; if it is broken by a <br>, the author most likely considers it not to be an indivisible whole. And if this <br> comes from HTMLizing text, it is usually less confusing to err by endowing the comma with the 'general' direction of the text than to err by switching it to the other side of 'English'. The reason is that the first type of error does nonetheless correspond somewhat to the order of reading (you only need to reinterpret an 'end-of-the-line' punctuation mark as an 'coming-after-the-last-word-in-this-line' punctuation mark) ; while the second option, in case of an error, makes you pause when first viewing the comma or period thinking that a new clause is beginning, then to move to the next line only to find that this is not the case and that 'English' belongs to the previous clause or sentence after all, and finally to re-read the text while switching the comma or period back to the end of the line in your head. I hope this clears up 'where I'm coming from' in requesting this change. This situation seems to me not entirely unlike the problem we had with minus as a number separator, https://bugzilla.mozilla.org/show_bug.cgi?id=73251 , in which the behavior of the minus was modified to accomodate one of its common uses in RTL text. There too, although MSIE was 'wrong' w.r.t. the Unicode standard's original scheme, it turned out that the 'wrong' way of laying out the minus was in general better than the 'right' way. I think we would be hard-pressed to find more than a handful, if even that, of a people breaking up their "English, and more" with <br>s, especially within RTL text. Eyal PS - As for audio-only browser, the point about <br>s being discarded is sort of moot, since when reading the HTML or text, the punctuation mark will always come after the last word if it is written after the last word - the 'switching' is done only for visual layout. Richard Ishida wrote: >Eyal, ><br> is intended for use as presentation oriented markup, not structural. >The HTML 4.01 spec clearly describes it as equivalent to a line separator >(ie. white space), as opposed to a paragraph delimiter[1]. Lines are not >semantically important in HTML. (You can think of the effect as similar to >what you'd expect from reducing the width of a window or box containing >text, but applying that to a single line at a time.) >The way Internet Explorer handles it looks clever in the context of the >example being used, but actually produces incorrect results in other >situations. >Try this code in a browser: >First, let's look at the original example from the bugzilla inclusion [for >the code snippets I replaced body with p, and changed hebrew characters to >xxxx, and display the text between markup in the order it appears *in >memory*]: ><p dir="rtl"> >1. xxxxxx xxxx English.<br> >2. xxx. ></p> >This produces this in Mozilla: > English. xxxx xxxxxx .1 > .xxx .2 >and this in IE: > .English xxxx xxxxxx .1 > .xxx .2 >Now compare that to: ><p dir="rtl"> >1. xxxxxx xxxx English,<br> >and more xxx. ></p> >In Mozilla you'll see: > English, xxxx xxxxxx .1 > .xxx and more >(which is correct) >and in IE you'll see: > ,English xxxx xxxxxx .1 > .xxx and more >(which is incorrect) > >So Mozilla is actually doing the right thing. >The way to think about this is that the <br> should actually be irrelevant >for things like voice browsers. It is only there to force the line to >visually wrap at a given point. In this way it is exactly the same as a >'soft carriage return' or 'forced line break' in other types of software. >Authors typically use these things for making line lengths the same without >setting the text box (often a questionable practise). In translation, for >example, it is routine to remove all such forced line breaks before >translation. This is because all the line breaks will occur at different >vertical alignments in the translated string because text on a line expands >or contracts at different rates as translated words are substituted. After >translation, if necessary, forced line breaks are put back in at the >appropriate places. >In terms of the bidi algorithm, the code snippets above are equivalent to ><p dir="rtl">1. xxxxxx xxxx English. 2. xxxx.</p> >and ><p dir="rtl">1. xxxxxx xxxx English, and more xxxx.</p> >So how do we get the original example to look how we want? >Like the old joke "What's the best way to get to <your city>?" "Well, I >wouldn't start from here." The code is a bad implementation. If this is a >numbered list, one ought to use list markup. If this is output from a text >box, and creating list elements is really too complicated (although things >like wiki's manage to figure it out), then enforce the use of carriage >returns as paragraph separators, and put each line in a separate p element. >Text input boxes should enforce appropriate behaviour by wrapping lines >automatically when they are too wide for the box. >It is possible for the author of the text to produce results that look >better using ‏ or its equivalent, but this rides roughshod over the true >structural problems of the text. > >So in summary, the problem is not with the bidi algorithm, nor with <br>, it >is with the way the text has been marked up. >Hope that helps, >RI >[1] http://www.w3.org/TR/html401/struct/text.html#edef-BR
Received on Saturday, 19 February 2005 03:03:15 UTC