Proposal for isolation characters in Unicode and the unicode-bidi:isolate and unicode-bidi:plaintext definitions from Aharon (Vladimir) Lanin on 2012-05-14 (www-style@w3.org from May 2012)

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Mon, 14 May 2012 22:06:21 +0200
To: W3C style mailing list <www-style@w3.org>, public-i18n-bidi@w3.org
Message-ID: <CA+FsOYYKGsKOrAoJ55jcYvYHVcWRojX4_KaCBfpQZSn8w=d3Ug@mail.gmail.com>
Last week, I wrote up and Mark Davis submitted to the UTC a proposal (
http://goo.gl/K6qtV) for adding bidi isolation to Unicode. Here is the
basic proposal:

--- start quote ---
Define three new Unicode formatting code points:
LRI: marks the beginning of a left-to-right isolate.
RLI: marks the beginning of a right-to-left isolate.
FSI: marks the beginning of a first-strong isolate.

Each would be matched with a PDF. Obviously, isolates would be allowed to
nest, just like embeddings. The visual ordering within an isolate is the
same as if it used LRE or RLE, as appropriate, instead of the characters
above. However, the visual ordering of the content outside the isolate and
the placement of the isolate within it is the same as if the isolate were
an ON-class character (e.g. U+FFFC).

Rules P2 and P3 for determining base direction (the first-strong algorithm)
have to be modified to skip over the content in an isolate (i.e. everything
between one of the new characters above and its matching PDF or end of
paragraph, whichever comes first). This is part of making an isolate behave
as a neutral character for the purposes of the visual ordering of the
content surrounding it. This change to P2 and P3 would apply both when
determining the base direction of a pragraph and when determining the
embedding direction of an FSI isolate.
--- end quote ---

This proposal is obviously inspired
by unicode-bidi:isolate and unicode-bidi:plaintext. The purpose of adding
similar features to Unicode is two-fold:

   - These features are just as necessary in plain text as in HTML.
   - Simplify things for browsers implementing unicode-bidi:isolate and
   unicode-bidi:plaintext, which currently need to use complicated and
   error-prone work-arounds.

Nevertheless, the proposal differs from the definitions of unicode-bidi:
isolate and unicode-bidi:plaintext in an important
way: unicode-bidi:isolate and unicode-bidi:plaintext allow paragraph breaks
inside the isolate, and define the content inside the isolate as forming a
sequence of paragraphs in this case. Thus:

1. A paragraph break does not end the isolate.
2. A paragraph break within the isolate does not constitute a paragraph
break in the surrounding content, which still behaves as if the whole
isolate were just a neutralcharacter.
3. An extra PDF within the isolate does not end it prematurely and the
effects of a missing one do not extend beyond the isolate, or even beyond
the paragraph missing the PDF.

This is all very different from the new characters as proposed, which are
strictly inline (just like LRE and RLE). The differences mean that the CSS
features as currently defined could not be implemented in terms of the
proposed characters.

This is obviously as a major problem. For one thing, it would preclude
implementing the CSS features as currently defined in terms of the new
Unicode features.

So, why did I propose it this way? Why not change the Unicode proposal to
allow the effects of the new isolating characters to span a paragraph
break, as was defined in CSS?

Well, one reason is that this would be a radical departure from existing
Unicode bidi algorithm rules and would have no chance of passing the
Unicode committees.
Nor would it resolve the difference regarding missing and extra PDFs. But,
most importantly, I am not sure that having the paragraph outside the
isolate continue uninterrupted around an isolate that contains a paragraph
break really makes much sense. The continuation is too non-local to make
much sense to the reader.

I think that it would be better instead to try to change the definition of
unicode-bidi:isolate and unicode-bidi:plaintext so that it could be
implemented in terms of the new Unicode characters, when and if they become
available - but could still be implemented until then using existing
techniques.

The change would be that a paragraph break within an inline isolate would
also end the paragraph outside the isolate. However, the existing rule
about reopening inline directionality embeddings after a paragraph break
inside an inline element would then come into play, and both the direction
outside the isolate and the isolate itself would then be reopened after the
paragraph break. Thus, the paragraph break would end one isolate - but then
open another one that will be closed by the isolate's end tag. For example:

<div dir=ltr>...<span dir=ltr>...<bdi
dir=rtl>...<br>...</bdi>...</span>...</div>
would be equivalent to:
<div dir=ltr>...<span dir=ltr>...<bdi dir=rtl>...</bdi></span><br><span
dir=ltr><bdi dir=rtl>...</bdi>...</span>...</div>

Isolates would not longer deal with missing/extra PDFs in CSS, but I
believe that those can and should be handled in a different manner
unrelated to isolates: having CSS remove extra PDFs and add missing ones
automatically. But that is another story.

The rest of this message is a first stab at the changes in the Writing
Modes and Text specs that would be necessary to do this. The good part is
that I believe that the change makes both of these specs simpler.

In "Inline Direction and Bidirectionality", the paragraph starting with the
words "Except when the ‘plaintext’ value of ‘unicode-bidi’ is in effect"
would now go like this:

------------
In most cases, the paragraph embedding level is set according to the value
of the ‘direction’ property of the paragraph's containing block. However,
when the containing block has ‘unicode-bidi: plaintext’, the paragraph
embedding level is set according to the the heuristic given in steps P2 and
P3 of the Unicode algorithm.
------------

Note that the last sentence of that paragraph (starting with 'The
paragraph's element is usually the containing block') is no longer
necessary.

The isolate definition would go like this:

------------
isolate
If the element is inline, this value opens an additional level of embedding
with respect to the bidirectional algorithm. The direction of this
embedding level is given by the ‘direction’ property. However, for the
purpose of bidi resolution of the content around it, this embedding must be
treated as if it were an Object Replacement Character(U+FFFC). Inside the
element, reordering is done implicitly.
------------

The "Inside the element, reordering is done implicitly" sentence comes from
the current definition of "embed". I don't actually understand what it
means in either context, but copied it for your consideration.

When and if LRI and RLI make it into the Unicode spec, the following
sentence should be added after "as if it were an Object Replacement
Character (U+FFFC)":

------------
This corresponds to adding an LRI (U+????) for ‘direction: ltr’, or RLE
(U+???) for ‘direction: rtl’, at the start of the element and a PDF
(U+202C) at the end of the element.
------------

The plaintext definition would go like this:

------------
plaintext
This value behaves as ‘isolate’ except that the direction of the embedding
started when the element is inline, or the base direction of the paragraph
started when the element isn't inline, is determined not by the element's
computed ‘direction’ as usual, but by following the heuristic in rules P2
and P3 of the Unicode bidirectional algorithm on the content within the
embedding or paragraph.
------------

When and if FSI makes it into the Unicode spec, the following sentence
should be added on:

------------
This corresponds to adding an FSI (U+????) at the start of an inline
element and a PDF (U+202C) at its end.
------------

The paragraph starting with "If an inline element is broken around a bidi
paragraph boundary" should now go something like this:

------------
When an inline element contains a bidi paragraph boundary (e.g. a block or
forced paragraph break), any embedding levels or overrides started by the
element are closed at the paragraph boundary and reopened on the other
side. When reopened, the embedding of a unicode-bidi:isolate or
unicode-bidi:plaintext
element is once again treated as if it were an Object Replacement
Character (U+FFFC)  for
the purpose of bidi resolution of the content around the embedding. The
base direction of the embedding reopened for a unicode-bidi:plaintext
element is determined by following the heuristic in rules P2 and P3 of the
Unicode bidirectional algorithm on the content within the reopened
embedding (which may differ from that of the embedding closed before the
paragraph boundary).
------------

In Text Level 3, the "Bidirectionality and Line Boxes" will need changing
because the "immediately contained" definition is no longer necessary:

------------
The start and end edges of a line box are determined by the inline base
direction of the line box. In most cases, this is given by its containing
block's computed ‘direction’. However if its containing block has
‘unicode-bidi:
plaintext’ [CSS3-WRITING-MODES], the inline base direction of the line box
must be determined by the base direction of the bidi paragraph for which
the line box holds content.  An empty line box [and so on, with no further
changes]
------------

Aharon
(Busily donning an anti-flame suit)
Received on Monday, 14 May 2012 20:07:12 UTC