Re: bidi and the initial current text position

Alex,

I'm afraid I don't agree with your statement that Batik is correct if it
produces "cd BA". [But read further in this message.] I have verified the
implementation I referred to in my previous email with the 216,357 test
sequences contained in the UAX#9 test suite [1], so I know that it is a
correct implementation. By the way, the presence of the space before "cd"
has no effect on the ordering.

In this case, the resolved levels are:

Input Text      : AB cd
Resolved Levels : 11000
After Reorder   : BA cd

See the updated attachments and also the following updated log (with a
single SPACE added before "cd"):

BD: RESOLVE: org.apache.fop.fo.pagination.PageSequence@1acc0e01[@id=]
BR[  0,  4] : 1: SOR(R), EOR(R)
BR[  4,  7] : 0: SOR(R), EOR(L)
RL: CC(7)
BD: default level(0)
‮    : RLO  1
A           : L    1
B           : L    1
‬    : PDF  1
     : WS   0
c           : L    0
d           : L    0
AL(1): B[0,1][0](1)
AL(1): T[1,3][1](1)
AL(1): B[3,4][3](1)
AL(0): T[4,7][4](0)
DR: block{ <&#x202E;AB&#x202C; cd>, intervals
<B[0,1][0](1),T[1,3][1](1),B[3,4][3](1),T[4,7][4](0)>}
BD: REORDER: INPUT:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content =
<AB> }
RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = <
> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: SPLIT INLINES:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content =
<AB> }
RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = <
> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: { min = 0, max = 1}
BD: REORDER: REORDERED RUNS:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content =
<AB> }
RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = <
> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }
BD: REORDER: REORDERED WORDS:
RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content =
<BA> }
RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = <
> }
RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content =
<cd> }

*Note well*, however, that if the paragraph embedding level is changed to 1,
i.e., RTL, then you do in fact get "cd BA" as you suggest. In this case, one
would have the following:

Input Text      : AB cd
Resolved Levels : 33122
After Reorder   : cd BA

Perhaps Batik is not correctly assigning the paragraph embedding level.
Since SVG doesn't have a paragraph construct per se, I would expect the
direction property on the <text/> element to determine the paragraph
embedding level. In Cameron's example, that was "rtl", which would make the
paragraph embedding level 0, not 1.

I have also added text XSL-FO and output PDF files showing the results if
the paragraph embedding level is changed to 1 (here, by adding a
writing-mode='rl' attribute on fo:block-container).

Regards,
Glenn

[1] http://www.unicode.org/Public/UNIDATA/BidiTest.txt

On Tue, May 17, 2011 at 9:36 PM, Alex Danilo <alex@abbra.com> wrote:

> Hi Cam,
>
> --Original Message--:
> >Hi Alex.
> >
> >Thanks for the reply.  Text is hard. ;)
>
> Sure is!
>
> >Cameron McCormack:
> >> > > <style>
> >> > >   text { direction: ltr }
> >> > >   tspan { direction: rtl; unicode-bidi: bidi-override }
> >> > > </style>
> >> > > <text x="100"><tspan>AB</tspan> cd</text>
> >> > >
> >> > > The visual order of this is “BA cd”. The <text> has
> >> > > text-anchor:start. Where is the text positioned?
> >…
> >> >  Gecko:        |BA cd  (where “|” is the vertical line at x = 100)
> >> >  IE:           |BA cd
> >> >  WebKit:  BA cd|
> >> >  Opera:   AB cd|
> >> >  Batik:        |cd BA
> >
> >Alex Danilo:
> >> OK, so our result (just to complicate things:-) is:
> >>
> >> Abbra: cd|BA
> >>
> >> The visual order of Batik is close to correct IMO.
>
> I just did a more detailed check with one of my BIDI implementations
> that passes a bunch of tests and is independent of the SVG engine.
>
> Batik is correct with the space placement, so "cd BA" is the expected
> result. I can't remember the detail why - UAX#9 has complex rules for
> numbers, white-space etc. and that just falls out.
>
> >That’s surprising to me, although I still don’t understand all the
> >intricacies of bidi layout so it could well be correct.  Why isn’t
> >“BAcd” the correct visual order?  Does the direction:ltr on the <text>
> >not make this an overall LTR chunk of text with an RTL run at the start
> >of it?  If I changed the example to
>
> This is where I'm not sure what we expect to happen.
>
> If you treat the "AB cd" as the thing you do BIDI processing on, then
> the embed levels are 11100 - seems the white-space is white space and so
> gets the RTL level (I think, without jumping in the debugger I'm not sure).
>
> >  <text x="100">xy <tspan>AB</tspan> cd</text>
> >
> >I would expect the visual order to be “xy BA cd”, so I am confused as to
> >why removing the “xy” should result in the “BA” going to the right side
> >of the “cd”.
>
> Yes, the order is correct, since you have "xy AB cd" which gives a BIDI
> level ordering of 00011000 or something like that, so because it's
> embedded within 2 LTR runs it doesn't swap. The BIDI swapping happens
> from the ends with the same embedding level and so, the cases are
> different.
>
> It is possible to force the start embedding level - i.e. tell the BIDI code
> that
> the starting level is '0' from the <text> and then the visual order would
> be
> more like you expected but I'm not sure that will be possible for things
> like Batik, etc. that rely on external libraries to handle the reordering.
>
> The question is: does the absence of any logical characters at the start
> of a text chunk with given directionality affect BIDI processing?
>
> Have you tried setting RTL on the <text> and LTR on the <tspan> to see
> what the implementations do? That might make another interesting
> data point.
>
> Also - while writing this reply I forced the starting BIDI clasification
> level
> in my code to 0 (LTR) and it still spat out the same layout. I don't want
> to
> have to read UAX#9 again, not enough time right now but I expect the
> RTL as the first actual content is dictating it get placed at the right
> side
> of the line since it is the starting text.
>
> >> Existence of the <tspan> doesn't create a new text chunk, it's just
> >> defining the directionality isn't it? If so, you are ordering the
> >> string:
> >>
> >> "AB cd" where "AB" is considered to be RTL, i.e. a UAX#9 embedding
> >> level of 1, whilst the " cd" has an embed level of 0. Running UAX#9
> >> will swap the "AB" as "BA" across to the right _to be read_ as the
> >> first string in the RTL line. The visual order can't start with BA,
> >> that's just plain broken.
> >
> >Ah, so why is it an RTL line and not an LTR line?  Is there a heuristic
> >there based on the first logical character being RTL meaning that the
> >line as a while is considered RTL?  If so, does the direction:ltr not
> >override that?
>
> Maybe it should, but reading UAX#9 is better than guessing this one.
> If you analyze the lines purely by the characteristics
> of the logical characters you'll get what I was describing and Batik does
> (and by implication Java Unicode handling I guess). If you split into
> chunks
> at the <tspan> boundaries you'll get the alternate. I don't know if the
> spec.
> mandates any sort of implicit embedding level from the 'LTR' on  the <text>
> but then again perhaps it should...
>
> In CSS3 Writing modes http://dev.w3.org/csswg/css3-writing-modes/
> I see:
>
> "? The ‘direction’ property has no effect on bidi reordering when specified
> on inline elements whose ‘unicode-bidi’ property's value is ‘normal’"
>
> and
>
> "bidi-override For inline elements this creates an override. For
> block-container
> elements this creates an override for inline-level descendants not within
> another
> block container element. This means that inside the element, reordering is
> strictly in sequence according to the ‘direction’ property; the implicit
> part of the
> bidirectional algorithm is ignored. This corresponds to adding a LRO
> (U+202D),
> for ‘direction: ltr’, or RLO (U+202E), for ‘direction: rtl’, at the start
> of the element
> and a PDF (U+202C) at the end of the element"
>
> So maybe we need to construct a test with RLO and PDF instead of the markup
> you have and see what happens.
>
> The first comment about the direction property doesn't seem to indicate we
> should assume
> LTR for the content, since there is no bidi-override for the <text> element
> itself.
>
> >> I don't see any prose in the spec. that says the existence of the
> >> <tspan> or the unicode-bidi:bidi-override etc. create a new text
> >> chunk. So I think the re-order should happen on the entire text chunk
> >> since the <tspan> does not introduce a new 'X' position or anything
> >> else that could be considered a chunk maker. They are 2 'runs' of
> >> text, but still one chunk I would have thought.
> >
> >Yes I agree with that.  (A “chunk maker” sounds like a particularly
> >nasty combination of alcoholic beverages. ;))
>
> Indeed!
>
> >> Now as for the space - it's in the LTR content " cd" and since the
> >> "AB" gets swapped across to the right side, the space leads the "cd"
> >> and so there should be no space after the "cd". Are you sure Batik
> >> stuck a space in there?
> >
> >Yeah: http://mcc.id.au/temp/bps-batik.png
>
> As I said above Batik is correct here, if analyzing just the concatenated
> string.
>
> >If I construct the equivalent HTML example (without the positioning, and
> >with a background colour on the RTL span):
> >
> >  http://people.mozilla.org/~cmccormack/tests/bidi-simple.html
> >
> >then I find that browsers uniformly render it as “BA cd”.
>
> That's good to know. But are the browsers treating each <span>
> as what we term a chunk? It would be good to stick some Arabic
> characters and latin in a single string to see what they do without
> the explicit settings, then separate the Arabic with a <span> to
> see if the BIDI analysis is being done entire line, or piece by piece.
>
> >> As for "current text position", I think what we're doing is wrong
> >> here. From the text in the spec. I'd expect to see:
> >>
> >> " cdBA|"
> >>
> >> namely, that the first logical character (the start character) is
> >> placed to the left of the starting position.
> >
> >OK.  I’ll wait to see your reasoning on the “cdBA” layout as opposed to
> >“BAcd”, but if you are right then that does make sense.  If “BAcd” is
> >the right layout (which is what I was assuming) then it’s trickier, and
> >my questions from my original mail about what that x="100" actually
> >means stands.
> >
> >> Anyway - more data for you, but the interoperability is a mess,
> >> and the BIDI handling of the implementations more so. If that was
> >> real Arabic, it would be totally unreadable in 4 out of the 6
> >> implementations...
> >
> >Thanks for looking into it.  I agree this is a bit of a mess, but on the
> >bright side, it gives us the opportunity to make changes for the better.
>
> "Why can't they just speak English":-)
>
> Alex
>
> >--
> >Cameron McCormack ≝ http://mcc.id.au/
> >
> >
> >
>
>
>
>

Received on Wednesday, 18 May 2011 04:05:56 UTC