Re: Proposal for isolation characters in Unicode and the unicode-bidi:isolate and unicode-bidi:plaintext definitions

I applaud the goal here.

I think the isolates constitute a major departure for the Unicode Bidi 
algorithm anyway, so that it would be better for Unicode to own up to it 
and either explicitly version the algorithm (UBA-2) or otherwise mark 
the departure.

If that were done, Unicode would have greater latitude to implement 
"less forward compatible" features.

Why is that important? Implementing a new set of characters that creates 
layout features that can be mirrorred in CSS but not the other way 
creates yet another area that will need workarounds. So this 
half-solution doesn't really seem to buy anything, but any 
implementation that supports it would be as incompatible with existing 
implementations as for the more radical solution.

Unicode views conformance in ways that gracefully adjust to adding 
characters to the standard. For bidi, this includes the stipulation that 
a conformant implementation only needs to support the behavior of 
characters it recognizes (as it can obviously not handle characters 
added in later versions of the standard).

For bidi, this allows implementations to be conformant that do not 
support embeddings etc. by claiming that they don't "interpret" said 
characters.

What would be more useful, would be UBA-2 which *required* full support 
of ALL special bidi characters (overrides, embeddings and the new 
isolates), not giving an implementation any "out" there. While still 
allowing implementations to not support specific right-to-left scripts, 
because, unlike formatting characters, their number and repertoire will 
likely increase.

A./


On 5/14/2012 1:06 PM, Aharon (Vladimir) Lanin wrote:
> Last week, I wrote up and Mark Davis submitted to the UTC a proposal 
> (http://goo.gl/K6qtV) for adding bidi isolation to Unicode. Here is 
> the basic proposal:
>
> --- start quote ---
> Define three new Unicode formatting code points:
> LRI: marks the beginning of a left-to-right isolate.
> RLI: marks the beginning of a right-to-left isolate.
> FSI: marks the beginning of a first-strong isolate.
>
> Each would be matched with a PDF. Obviously, isolates would be allowed 
> to nest, just like embeddings. The visual ordering within an isolate 
> is the same as if it used LRE or RLE, as appropriate, instead of the 
> characters above. However, the visual ordering of the content outside 
> the isolate and the placement of the isolate within it is the same as 
> if the isolate were an ON-class character (e.g. U+FFFC).
>
> Rules P2 and P3 for determining base direction (the first-strong 
> algorithm) have to be modified to skip over the content in an isolate 
> (i.e. everything between one of the new characters above and its 
> matching PDF or end of paragraph, whichever comes first). This is part 
> of making an isolate behave as a neutral character for the purposes of 
> the visual ordering of the content surrounding it. This change to P2 
> and P3 would apply both when determining the base direction of a 
> pragraph and when determining the embedding direction of an FSI isolate.
> --- end quote ---
>
> This proposal is obviously inspired 
> by unicode-bidi:isolate and unicode-bidi:plaintext. The purpose of 
> adding similar features to Unicode is two-fold:
>
>   * These features are just as necessary in plain text as in HTML.
>   * Simplify things for browsers implementing unicode-bidi:isolate and
>     unicode-bidi:plaintext, which currently need to use complicated
>     and error-prone work-arounds.
>
> Nevertheless, the proposal differs from the definitions of 
> unicode-bidi:isolate and unicode-bidi:plaintext in an important 
> way: unicode-bidi:isolate and unicode-bidi:plaintext allow paragraph 
> breaks inside the isolate, and define the content inside the 
> isolate as forming a sequence of paragraphs in this case. Thus:
>
> 1. A paragraph break does not end the isolate.
> 2. A paragraph break within the isolate does not constitute a 
> paragraph break in the surrounding content, which still behaves as if 
> the whole isolate were just a neutralcharacter.
> 3. An extra PDF within the isolate does not end it prematurely and the 
> effects of a missing one do not extend beyond the isolate, or even 
> beyond the paragraph missing the PDF.
>
> This is all very different from the new characters as proposed, which 
> are strictly inline (just like LRE and RLE). The differences mean that 
> the CSS features as currently defined could not be implemented in 
> terms of the proposed characters.
>
> This is obviously as a major problem. For one thing, it would preclude 
> implementing the CSS features as currently defined in terms of the new 
> Unicode features.
>
> So, why did I propose it this way? Why not change the Unicode proposal 
> to allow the effects of the new isolating characters to span a 
> paragraph break, as was defined in CSS?
>
> Well, one reason is that this would be a radical departure from 
> existing Unicode bidi algorithm rules and would have no chance of 
> passing the Unicode committees. Nor would it resolve the difference 
> regarding missing and extra PDFs. But, most importantly, I am not sure 
> that having the paragraph outside the isolate continue uninterrupted 
> around an isolate that contains a paragraph break really makes much 
> sense. The continuation is too non-local to make much sense to the reader.
>
> I think that it would be better instead to try to change the 
> definition of unicode-bidi:isolate and unicode-bidi:plaintext so that 
> it could be implemented in terms of the new Unicode characters, when 
> and if they become available - but could still be implemented until 
> then using existing techniques.
>
> The change would be that a paragraph break within an inline 
> isolate would also end the paragraph outside the isolate. However, the 
> existing rule about reopening inline directionality embeddings after a 
> paragraph break inside an inline element would then come into play, 
> and both the direction outside the isolate and the isolate itself 
> would then be reopened after the paragraph break. Thus, the paragraph 
> break would end one isolate - but then open another one that will be 
> closed by the isolate's end tag. For example:
>
> <div dir=ltr>...<span dir=ltr>...<bdi 
> dir=rtl>...<br>...</bdi>...</span>...</div>
> would be equivalent to:
> <div dir=ltr>...<span dir=ltr>...<bdi 
> dir=rtl>...</bdi></span><br><span dir=ltr><bdi 
> dir=rtl>...</bdi>...</span>...</div>
>
> Isolates would not longer deal with missing/extra PDFs in CSS, but I 
> believe that those can and should be handled in a different manner 
> unrelated to isolates: having CSS remove extra PDFs and add missing 
> ones automatically. But that is another story.
>
> The rest of this message is a first stab at the changes in the Writing 
> Modes and Text specs that would be necessary to do this. The good part 
> is that I believe that the change makes both of these specs simpler.
>
> In "Inline Direction and Bidirectionality", the paragraph starting 
> with the words "Except when the ‘plaintext’ value of ‘unicode-bidi’ is 
> in effect" would now go like this:
>
> ------------
> In most cases, the paragraph embedding level is set according to the 
> value of the ‘direction’ property of the paragraph's containing block. 
> However, when the containing block has ‘unicode-bidi: plaintext’, the 
> paragraph embedding level is set according to the the heuristic given 
> in steps P2 and P3 of the Unicode algorithm.
> ------------
>
> Note that the last sentence of that paragraph (starting with 'The 
> paragraph's element is usually the containing block') is no longer 
> necessary.
>
> The isolate definition would go like this:
>
> ------------
> isolate
> If the element is inline, this value opens an additional level of 
> embedding with respect to the bidirectional algorithm. The direction 
> of this embedding level is given by the ‘direction’ property. 
> However, for the purpose of bidi resolution of the content around it, 
> this embedding must be treated as if it were an Object Replacement 
> Character(U+FFFC). Inside the element, reordering is done implicitly.
> ------------
>
> The "Inside the element, reordering is done implicitly" sentence comes 
> from the current definition of "embed". I don't actually understand 
> what it means in either context, but copied it for your consideration.
>
> When and if LRI and RLI make it into the Unicode spec, the following 
> sentence should be added after "as if it were an Object Replacement 
> Character (U+FFFC)":
>
> ------------
> This corresponds to adding an LRI (U+????) for ‘direction: ltr’, or 
> RLE (U+???) for ‘direction: rtl’, at the start of the element and a 
> PDF (U+202C) at the end of the element.
> ------------
>
> The plaintext definition would go like this:
>
> ------------
> plaintext
> This value behaves as ‘isolate’ except that the direction of the 
> embedding started when the element is inline, or the base direction of 
> the paragraph started when the element isn't inline, is determined not 
> by the element's computed ‘direction’ as usual, but by following the 
> heuristic in rules P2 and P3 of the Unicode bidirectional algorithm on 
> the content within the embedding or paragraph.
> ------------
>
> When and if FSI makes it into the Unicode spec, the following sentence 
> should be added on:
>
> ------------
> This corresponds to adding an FSI (U+????) at the start of an inline 
> element and a PDF (U+202C) at its end.
> ------------
>
> The paragraph starting with "If an inline element is broken around a 
> bidi paragraph boundary" should now go something like this:
>
> ------------
> When an inline element contains a bidi paragraph boundary (e.g. a 
> block or forced paragraph break), any embedding levels or overrides 
> started by the element are closed at the paragraph boundary and 
> reopened on the other side. When reopened, the embedding of a 
> unicode-bidi:isolate or unicode-bidi:plaintext element is once again 
> treated as if it were an Object Replacement Character (U+FFFC)  for 
> the purpose of bidi resolution of the content around the embedding. 
> The base direction of the embedding reopened for a 
> unicode-bidi:plaintext element is determined by following the 
> heuristic in rules P2 and P3 of the Unicode bidirectional algorithm on 
> the content within the reopened embedding (which may differ from that 
> of the embedding closed before the paragraph boundary).
> ------------
>
> In Text Level 3, the "Bidirectionality and Line Boxes" will need 
> changing because the "immediately contained" definition is no longer 
> necessary:
>
> ------------
> The start and end edges of a line box are determined by the inline 
> base direction of the line box. In most cases, this is given by its 
> containing block's computed ‘direction’. However if its containing 
> block has ‘unicode-bidi: plaintext’ [CSS3-WRITING-MODES], the inline 
> base direction of the line box must be determined by the base 
> direction of the bidi paragraph for which the line box holds 
> content.  An empty line box [and so on, with no further changes]
> ------------
>
> Aharon
> (Busily donning an anti-flame suit)

Received on Tuesday, 15 May 2012 00:04:35 UTC