- From: Aharon (Vladimir) Lanin <aharon@google.com>
- Date: Mon, 14 May 2012 22:06:21 +0200
- To: W3C style mailing list <www-style@w3.org>, public-i18n-bidi@w3.org
- Message-ID: <CA+FsOYYKGsKOrAoJ55jcYvYHVcWRojX4_KaCBfpQZSn8w=d3Ug@mail.gmail.com>
Last week, I wrote up and Mark Davis submitted to the UTC a proposal ( http://goo.gl/K6qtV) for adding bidi isolation to Unicode. Here is the basic proposal: --- start quote --- Define three new Unicode formatting code points: LRI: marks the beginning of a left-to-right isolate. RLI: marks the beginning of a right-to-left isolate. FSI: marks the beginning of a first-strong isolate. Each would be matched with a PDF. Obviously, isolates would be allowed to nest, just like embeddings. The visual ordering within an isolate is the same as if it used LRE or RLE, as appropriate, instead of the characters above. However, the visual ordering of the content outside the isolate and the placement of the isolate within it is the same as if the isolate were an ON-class character (e.g. U+FFFC). Rules P2 and P3 for determining base direction (the first-strong algorithm) have to be modified to skip over the content in an isolate (i.e. everything between one of the new characters above and its matching PDF or end of paragraph, whichever comes first). This is part of making an isolate behave as a neutral character for the purposes of the visual ordering of the content surrounding it. This change to P2 and P3 would apply both when determining the base direction of a pragraph and when determining the embedding direction of an FSI isolate. --- end quote --- This proposal is obviously inspired by unicode-bidi:isolate and unicode-bidi:plaintext. The purpose of adding similar features to Unicode is two-fold: - These features are just as necessary in plain text as in HTML. - Simplify things for browsers implementing unicode-bidi:isolate and unicode-bidi:plaintext, which currently need to use complicated and error-prone work-arounds. Nevertheless, the proposal differs from the definitions of unicode-bidi: isolate and unicode-bidi:plaintext in an important way: unicode-bidi:isolate and unicode-bidi:plaintext allow paragraph breaks inside the isolate, and define the content inside the isolate as forming a sequence of paragraphs in this case. Thus: 1. A paragraph break does not end the isolate. 2. A paragraph break within the isolate does not constitute a paragraph break in the surrounding content, which still behaves as if the whole isolate were just a neutralcharacter. 3. An extra PDF within the isolate does not end it prematurely and the effects of a missing one do not extend beyond the isolate, or even beyond the paragraph missing the PDF. This is all very different from the new characters as proposed, which are strictly inline (just like LRE and RLE). The differences mean that the CSS features as currently defined could not be implemented in terms of the proposed characters. This is obviously as a major problem. For one thing, it would preclude implementing the CSS features as currently defined in terms of the new Unicode features. So, why did I propose it this way? Why not change the Unicode proposal to allow the effects of the new isolating characters to span a paragraph break, as was defined in CSS? Well, one reason is that this would be a radical departure from existing Unicode bidi algorithm rules and would have no chance of passing the Unicode committees. Nor would it resolve the difference regarding missing and extra PDFs. But, most importantly, I am not sure that having the paragraph outside the isolate continue uninterrupted around an isolate that contains a paragraph break really makes much sense. The continuation is too non-local to make much sense to the reader. I think that it would be better instead to try to change the definition of unicode-bidi:isolate and unicode-bidi:plaintext so that it could be implemented in terms of the new Unicode characters, when and if they become available - but could still be implemented until then using existing techniques. The change would be that a paragraph break within an inline isolate would also end the paragraph outside the isolate. However, the existing rule about reopening inline directionality embeddings after a paragraph break inside an inline element would then come into play, and both the direction outside the isolate and the isolate itself would then be reopened after the paragraph break. Thus, the paragraph break would end one isolate - but then open another one that will be closed by the isolate's end tag. For example: <div dir=ltr>...<span dir=ltr>...<bdi dir=rtl>...<br>...</bdi>...</span>...</div> would be equivalent to: <div dir=ltr>...<span dir=ltr>...<bdi dir=rtl>...</bdi></span><br><span dir=ltr><bdi dir=rtl>...</bdi>...</span>...</div> Isolates would not longer deal with missing/extra PDFs in CSS, but I believe that those can and should be handled in a different manner unrelated to isolates: having CSS remove extra PDFs and add missing ones automatically. But that is another story. The rest of this message is a first stab at the changes in the Writing Modes and Text specs that would be necessary to do this. The good part is that I believe that the change makes both of these specs simpler. In "Inline Direction and Bidirectionality", the paragraph starting with the words "Except when the ‘plaintext’ value of ‘unicode-bidi’ is in effect" would now go like this: ------------ In most cases, the paragraph embedding level is set according to the value of the ‘direction’ property of the paragraph's containing block. However, when the containing block has ‘unicode-bidi: plaintext’, the paragraph embedding level is set according to the the heuristic given in steps P2 and P3 of the Unicode algorithm. ------------ Note that the last sentence of that paragraph (starting with 'The paragraph's element is usually the containing block') is no longer necessary. The isolate definition would go like this: ------------ isolate If the element is inline, this value opens an additional level of embedding with respect to the bidirectional algorithm. The direction of this embedding level is given by the ‘direction’ property. However, for the purpose of bidi resolution of the content around it, this embedding must be treated as if it were an Object Replacement Character(U+FFFC). Inside the element, reordering is done implicitly. ------------ The "Inside the element, reordering is done implicitly" sentence comes from the current definition of "embed". I don't actually understand what it means in either context, but copied it for your consideration. When and if LRI and RLI make it into the Unicode spec, the following sentence should be added after "as if it were an Object Replacement Character (U+FFFC)": ------------ This corresponds to adding an LRI (U+????) for ‘direction: ltr’, or RLE (U+???) for ‘direction: rtl’, at the start of the element and a PDF (U+202C) at the end of the element. ------------ The plaintext definition would go like this: ------------ plaintext This value behaves as ‘isolate’ except that the direction of the embedding started when the element is inline, or the base direction of the paragraph started when the element isn't inline, is determined not by the element's computed ‘direction’ as usual, but by following the heuristic in rules P2 and P3 of the Unicode bidirectional algorithm on the content within the embedding or paragraph. ------------ When and if FSI makes it into the Unicode spec, the following sentence should be added on: ------------ This corresponds to adding an FSI (U+????) at the start of an inline element and a PDF (U+202C) at its end. ------------ The paragraph starting with "If an inline element is broken around a bidi paragraph boundary" should now go something like this: ------------ When an inline element contains a bidi paragraph boundary (e.g. a block or forced paragraph break), any embedding levels or overrides started by the element are closed at the paragraph boundary and reopened on the other side. When reopened, the embedding of a unicode-bidi:isolate or unicode-bidi:plaintext element is once again treated as if it were an Object Replacement Character (U+FFFC) for the purpose of bidi resolution of the content around the embedding. The base direction of the embedding reopened for a unicode-bidi:plaintext element is determined by following the heuristic in rules P2 and P3 of the Unicode bidirectional algorithm on the content within the reopened embedding (which may differ from that of the embedding closed before the paragraph boundary). ------------ In Text Level 3, the "Bidirectionality and Line Boxes" will need changing because the "immediately contained" definition is no longer necessary: ------------ The start and end edges of a line box are determined by the inline base direction of the line box. In most cases, this is given by its containing block's computed ‘direction’. However if its containing block has ‘unicode-bidi: plaintext’ [CSS3-WRITING-MODES], the inline base direction of the line box must be determined by the base direction of the bidi paragraph for which the line box holds content. An empty line box [and so on, with no further changes] ------------ Aharon (Busily donning an anti-flame suit)
Received on Monday, 14 May 2012 20:07:12 UTC