Re: bidi and the initial current text position

Cameron,

I agree that the correct ordering should be "BA cd" if bidi processing is
being performed by the SVG UA.

The resolved bidi levels are as follows (assuming that the direction
property on the <text> element determines the default level):

Input Text      : ABcd
Resolved Levels : 1100
After Reorder   : BAcd

Attached is an input XSL-FO file and output PDF file produced from the
XSL-FO equivalent of your input, and as you can see it indeed generates
"BAcd". Following is some logging output produced during from the bidi
processing. You can see that the use of the override (on "AB") causes the
synthesis of wrapping RLO (U+202E) and PDF (U+202C) Unicode Bidi control
characters to feed into the bidi processor, with these being later removed.
For more details, see XSL-FO 5.8 Unicode Bidi Processing [1]. You can also
find the code for this processing at [2].

BD: RESOLVE: org.apache.fop.fo.pagination.PageSequence@6f18278a[@id=]
BR[  0,  4] : 1: SOR(R), EOR(R)
BR[  4,  6] : 0: SOR(R), EOR(L)
RL: CC(6)
BD: default level(0)
&#x202E;    : RLO  1
A           : L    1
B           : L    1
&#x202C;    : PDF  1
c           : L    0
d           : L    0
AL(1): B[0,1][0](1)
AL(1): T[1,3][1](1)
AL(1): B[3,4][3](1)
AL(0): T[4,6][4](0)
DR: block{ <&#x202E;AB&#x202C;cd>, intervals
<B[0,1][0](1),T[1,3][1](1),B[3,4][3](1),T[4,6][4](0)>}
BD: REORDER: INPUT:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content =
<AB> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: SPLIT INLINES:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content =
<AB> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: { min = 0, max = 1}
BD: REORDER: REORDERED RUNS:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content =
<AB> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: REORDERED WORDS:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content =
<BA> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }

G.

[1] http://www.w3.org/TR/2006/REC-xsl11-20061205/#d0e4879
[2]
http://github.com/skynavga/fop/blob/i18n.arabic/src/java/org/apache/fop/layoutmgr/BidiUtil.java

On Tue, May 17, 2011 at 8:35 PM, Cameron McCormack <cam@mcc.id.au> wrote:

> Hi Alex.
>
> Thanks for the reply.  Text is hard. ;)
>
> Cameron McCormack:
> > > > <style>
> > > >   text { direction: ltr }
> > > >   tspan { direction: rtl; unicode-bidi: bidi-override }
> > > > </style>
> > > > <text x="100"><tspan>AB</tspan> cd</text>
> > > >
> > > > The visual order of this is “BA cd”. The <text> has
> > > > text-anchor:start. Where is the text positioned?
> …
> > >  Gecko:        |BA cd  (where “|” is the vertical line at x = 100)
> > >  IE:           |BA cd
> > >  WebKit:  BA cd|
> > >  Opera:   AB cd|
> > >  Batik:        |cd BA
>
> Alex Danilo:
> > OK, so our result (just to complicate things:-) is:
> >
> > Abbra: cd|BA
> >
> > The visual order of Batik is close to correct IMO.
>
> That’s surprising to me, although I still don’t understand all the
> intricacies of bidi layout so it could well be correct.  Why isn’t
> “BAcd” the correct visual order?  Does the direction:ltr on the <text>
> not make this an overall LTR chunk of text with an RTL run at the start
> of it?  If I changed the example to
>
>  <text x="100">xy <tspan>AB</tspan> cd</text>
>
> I would expect the visual order to be “xy BA cd”, so I am confused as to
> why removing the “xy” should result in the “BA” going to the right side
> of the “cd”.
>
> > Existence of the <tspan> doesn't create a new text chunk, it's just
> > defining the directionality isn't it? If so, you are ordering the
> > string:
> >
> > "AB cd" where "AB" is considered to be RTL, i.e. a UAX#9 embedding
> > level of 1, whilst the " cd" has an embed level of 0. Running UAX#9
> > will swap the "AB" as "BA" across to the right _to be read_ as the
> > first string in the RTL line. The visual order can't start with BA,
> > that's just plain broken.
>
> Ah, so why is it an RTL line and not an LTR line?  Is there a heuristic
> there based on the first logical character being RTL meaning that the
> line as a while is considered RTL?  If so, does the direction:ltr not
> override that?
>
> > I don't see any prose in the spec. that says the existence of the
> > <tspan> or the unicode-bidi:bidi-override etc. create a new text
> > chunk. So I think the re-order should happen on the entire text chunk
> > since the <tspan> does not introduce a new 'X' position or anything
> > else that could be considered a chunk maker. They are 2 'runs' of
> > text, but still one chunk I would have thought.
>
> Yes I agree with that.  (A “chunk maker” sounds like a particularly
> nasty combination of alcoholic beverages. ;))
>
> > Now as for the space - it's in the LTR content " cd" and since the
> > "AB" gets swapped across to the right side, the space leads the "cd"
> > and so there should be no space after the "cd". Are you sure Batik
> > stuck a space in there?
>
> Yeah: http://mcc.id.au/temp/bps-batik.png
>
> If I construct the equivalent HTML example (without the positioning, and
> with a background colour on the RTL span):
>
>  http://people.mozilla.org/~cmccormack/tests/bidi-simple.html
>
> then I find that browsers uniformly render it as “BA cd”.
>
> > As for "current text position", I think what we're doing is wrong
> > here. From the text in the spec. I'd expect to see:
> >
> > " cdBA|"
> >
> > namely, that the first logical character (the start character) is
> > placed to the left of the starting position.
>
> OK.  I’ll wait to see your reasoning on the “cdBA” layout as opposed to
> “BAcd”, but if you are right then that does make sense.  If “BAcd” is
> the right layout (which is what I was assuming) then it’s trickier, and
> my questions from my original mail about what that x="100" actually
> means stands.
>
> > Anyway - more data for you, but the interoperability is a mess,
> > and the BIDI handling of the implementations more so. If that was
> > real Arabic, it would be totally unreadable in 4 out of the 6
> > implementations...
>
> Thanks for looking into it.  I agree this is a bit of a mess, but on the
> bright side, it gives us the opportunity to make changes for the better.
>
> --
> Cameron McCormack ≝ http://mcc.id.au/
>
>

Received on Wednesday, 18 May 2011 03:30:33 UTC