Re: bidi and the initial current text position

Alex,

Glad we are converging. By the way, attached is a more complex example of
vowelized Arabic, exhibiting a variety of advanced typographic features
driven by the OpenType GSUB/GPOS processing, which I perform directly in
FOP, including:

   - obligatory ligatures
   - conditional (font selected) ligatures
   - combining mark ligatures (mark on mark)
   - mark on base positioning
   - mark on ligature positioning
   - mark on mark positioning
   - base on base positioning (end of ayah)

As you can see, there is some variation across the four fonts used in the
example. FOP merely applies the available GSUB/GPOS tables during the
character to glyph mapping (substitution) and positioning process, so it is
the font designer that is determining the result to a significant extent.

Cheers, Glenn


On Tue, May 17, 2011 at 10:21 PM, Alex Danilo <alex@abbra.com> wrote:

> Hi Glenn,
>
>        Doing more experiments I agree with what you're saying.
>
>        At issue is the unicode-bidi property handling.  We don't
> process it at all, and the lack of RLO and PDF behaviour results
> in what Batik (and we) do. So, given the text in CSS3 and another
> little test I wrote here we get the "BA cd" ordering as well.
>
>        Thanks for running the tests.
>
> Alex
>
> --Original Message--:
> >
> >Alex,
> >
> >I'm afraid I don't agree with your statement that Batik is correct if it
> produces "cd BA". [But read further in this message.] I have verified the
> implementation I referred to in my previous email with the 216,357 test
> sequences contained in the UAX#9 test suite [1], so I know that it is a
> correct implementation. By the way, the presence of the space before "cd"
> has no effect on the ordering.
> >
> >In this case, the resolved levels are:
> >
> >Input Text      : AB cd
> >Resolved Levels : 11000
> >After Reorder   : BA cd
> >
> >See the updated attachments and also the following updated log (with a
> single SPACE added before "cd"):
> >
> >BD: RESOLVE: org.apache.fop.fo.pagination.PageSequence@1acc0e01[@id=]
> >BR[  0,  4] : 1: SOR(R), EOR(R)
> >BR[  4,  7] : 0: SOR(R), EOR(L)
> >RL: CC(7)
> >BD: default level(0)
> >&#x202E;    : RLO  1
> >A           : L    1
> >B           : L    1
> >&#x202C;    : PDF  1
> >&#x0020;    : WS   0
> >c           : L    0
> >d           : L    0
> >AL(1): B[0,1][0](1)
> >AL(1): T[1,3][1](1)
> >AL(1): B[3,4][3](1)
> >AL(0): T[4,7][4](0)
> >DR: block{ <&#x202E;AB&#x202C; cd>, intervals
> <B[0,1][0](1),T[1,3][1](1),B[3,4][3](1),T[4,7][4](0)>}
> >BD: REORDER: INPUT:
> >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content
> = <AB> }
> >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content =
> < > }
> >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content
> = <cd> }
> >BD: REORDER: SPLIT INLINES:
> >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content
> = <AB> }
> >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content =
> < > }
> >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content
> = <cd> }
> >BD: REORDER: { min = 0, max = 1}
> >BD: REORDER: REORDERED RUNS:
> >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content
> = <AB> }
> >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content =
> < > }
> >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content
> = <cd> }
> >BD: REORDER: REORDERED WORDS:
> >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content
> = <BA> }
> >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content =
> < > }
> >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content
> = <cd> }
> >
> >
> >Note well, however, that if the paragraph embedding level is changed to 1,
> i.e., RTL, then you do in fact get "cd BA" as you suggest. In this case, one
> would have the following:
> >
> >Input Text      : AB cd
> >Resolved Levels : 33122
> >After Reorder   : cd BA
> >
> >
> >Perhaps Batik is not correctly assigning the paragraph embedding level.
> Since SVG doesn't have a paragraph construct per se, I would expect the
> direction property on the <text/> element to determine the paragraph
> embedding level. In Cameron's example, that was "rtl", which would make the
> paragraph embedding level 0, not 1.
> >
> >I have also added text XSL-FO and output PDF files showing the results if
> the paragraph embedding level is changed to 1 (here, by adding a
> writing-mode='rl' attribute on fo:block-container).
> >
> >Regards,
> >Glenn
> >
> >[1] http://www.unicode.org/Public/UNIDATA/BidiTest.txt
> >
> >On Tue, May 17, 2011 at 9:36 PM, Alex Danilo <alex@abbra.com> wrote:
> >
> >Hi Cam,
> >
> >--Original Message--:
> >>Hi Alex.
> >>
> >>Thanks for the reply.  Text is hard. ;)
> >
> >Sure is!
> >
> >>Cameron McCormack:
> >>>>><style>
> >>>>>  text { direction: ltr }
> >>>>>  tspan { direction: rtl; unicode-bidi: bidi-override }
> >>>>></style>
> >>>>><text x="100"><tspan>AB</tspan> cd</text>
> >>>>>
> >>>>>The visual order of this is “BA cd”. The <text> has
> >>>>>text-anchor:start. Where is the text positioned?
> >>…
> >>>> Gecko:        |BA cd  (where “|” is the vertical line at x = 100)
> >>>> IE:           |BA cd
> >>>> WebKit:  BA cd|
> >>>> Opera:   AB cd|
> >>>> Batik:        |cd BA
> >>
> >>Alex Danilo:
> >>>OK, so our result (just to complicate things:-) is:
> >>>
> >>>Abbra: cd|BA
> >>>
> >>>The visual order of Batik is close to correct IMO.
> >
> >I just did a more detailed check with one of my BIDI implementations
> >that passes a bunch of tests and is independent of the SVG engine.
> >
> >Batik is correct with the space placement, so "cd BA" is the expected
> >result. I can't remember the detail why - UAX#9 has complex rules for
> >numbers, white-space etc. and that just falls out.
> >
> >>That’s surprising to me, although I still don’t understand all the
> >>intricacies of bidi layout so it could well be correct.  Why isn’t
> >>“BAcd” the correct visual order?  Does the direction:ltr on the <text>
> >>not make this an overall LTR chunk of text with an RTL run at the start
> >>of it?  If I changed the example to
> >
> >This is where I'm not sure what we expect to happen.
> >
> >If you treat the "AB cd" as the thing you do BIDI processing on, then
> >the embed levels are 11100 - seems the white-space is white space and so
> >gets the RTL level (I think, without jumping in the debugger I'm not
> sure).
> >
> >> <text x="100">xy <tspan>AB</tspan> cd</text>
> >>
> >>I would expect the visual order to be “xy BA cd”, so I am confused as to
> >>why removing the “xy” should result in the “BA” going to the right side
> >>of the “cd”.
> >
> >Yes, the order is correct, since you have "xy AB cd" which gives a BIDI
> >level ordering of 00011000 or something like that, so because it's
> >embedded within 2 LTR runs it doesn't swap. The BIDI swapping happens
> >from the ends with the same embedding level and so, the cases are
> >different.
> >
> >It is possible to force the start embedding level - i.e. tell the BIDI
> code that
> >the starting level is '0' from the <text> and then the visual order would
> be
> >more like you expected but I'm not sure that will be possible for things
> >like Batik, etc. that rely on external libraries to handle the reordering.
> >
> >The question is: does the absence of any logical characters at the start
> >of a text chunk with given directionality affect BIDI processing?
> >
> >Have you tried setting RTL on the <text> and LTR on the <tspan> to see
> >what the implementations do? That might make another interesting
> >data point.
> >
> >Also - while writing this reply I forced the starting BIDI clasification
> level
> >in my code to 0 (LTR) and it still spat out the same layout. I don't want
> to
> >have to read UAX#9 again, not enough time right now but I expect the
> >RTL as the first actual content is dictating it get placed at the right
> side
> >of the line since it is the starting text.
> >
> >>>Existence of the <tspan> doesn't create a new text chunk, it's just
> >>>defining the directionality isn't it? If so, you are ordering the
> >>>string:
> >>>
> >>>"AB cd" where "AB" is considered to be RTL, i.e. a UAX#9 embedding
> >>>level of 1, whilst the " cd" has an embed level of 0. Running UAX#9
> >>>will swap the "AB" as "BA" across to the right _to be read_ as the
> >>>first string in the RTL line. The visual order can't start with BA,
> >>>that's just plain broken.
> >>
> >>Ah, so why is it an RTL line and not an LTR line?  Is there a heuristic
> >>there based on the first logical character being RTL meaning that the
> >>line as a while is considered RTL?  If so, does the direction:ltr not
> >>override that?
> >
> >Maybe it should, but reading UAX#9 is better than guessing this one.
> >If you analyze the lines purely by the characteristics
> >of the logical characters you'll get what I was describing and Batik does
> >(and by implication Java Unicode handling I guess). If you split into
> chunks
> >at the <tspan> boundaries you'll get the alternate. I don't know if the
> spec.
> >mandates any sort of implicit embedding level from the 'LTR' on  the
> <text>
> >but then again perhaps it should...
> >
> >In CSS3 Writing modes http://dev.w3.org/csswg/css3-writing-modes/
> >I see:
> >
> >"? The ‘direction’ property has no effect on bidi reordering when
> specified on inline elements whose ‘unicode-bidi’ property's value is
> ‘normal’"
> >
> >and
> >
> >"bidi-override For inline elements this creates an override. For
> block-container
> >elements this creates an override for inline-level descendants not within
> another
> >block container element. This means that inside the element, reordering is
> >strictly in sequence according to the ‘direction’ property; the implicit
> part of the
> >bidirectional algorithm is ignored. This corresponds to adding a LRO
> (U+202D),
> >for ‘direction: ltr’, or RLO (U+202E), for ‘direction: rtl’, at the start
> of the element
> >and a PDF (U+202C) at the end of the element"
> >
> >So maybe we need to construct a test with RLO and PDF instead of the
> markup
> >you have and see what happens.
> >
> >The first comment about the direction property doesn't seem to indicate we
> should assume
> >LTR for the content, since there is no bidi-override for the <text>
> element itself.
> >
> >>>I don't see any prose in the spec. that says the existence of the
> >>><tspan> or the unicode-bidi:bidi-override etc. create a new text
> >>>chunk. So I think the re-order should happen on the entire text chunk
> >>>since the <tspan> does not introduce a new 'X' position or anything
> >>>else that could be considered a chunk maker. They are 2 'runs' of
> >>>text, but still one chunk I would have thought.
> >>
> >>Yes I agree with that.  (A “chunk maker” sounds like a particularly
> >>nasty combination of alcoholic beverages. ;))
> >
> >Indeed!
> >
> >>>Now as for the space - it's in the LTR content " cd" and since the
> >>>"AB" gets swapped across to the right side, the space leads the "cd"
> >>>and so there should be no space after the "cd". Are you sure Batik
> >>>stuck a space in there?
> >>
> >>Yeah: http://mcc.id.au/temp/bps-batik.png
> >
> >As I said above Batik is correct here, if analyzing just the concatenated
> string.
> >
> >>If I construct the equivalent HTML example (without the positioning, and
> >>with a background colour on the RTL span):
> >>
> >> http://people.mozilla.org/~cmccormack/tests/bidi-simple.html
> >>
> >>then I find that browsers uniformly render it as “BA cd”.
> >
> >That's good to know. But are the browsers treating each <span>
> >as what we term a chunk? It would be good to stick some Arabic
> >characters and latin in a single string to see what they do without
> >the explicit settings, then separate the Arabic with a <span> to
> >see if the BIDI analysis is being done entire line, or piece by piece.
> >
> >>>As for "current text position", I think what we're doing is wrong
> >>>here. From the text in the spec. I'd expect to see:
> >>>
> >>>"cdBA|"
> >>>
> >>>namely, that the first logical character (the start character) is
> >>>placed to the left of the starting position.
> >>
> >>OK.  I’ll wait to see your reasoning on the “cdBA” layout as opposed to
> >>“BAcd”, but if you are right then that does make sense.  If “BAcd” is
> >>the right layout (which is what I was assuming) then it’s trickier, and
> >>my questions from my original mail about what that x="100" actually
> >>means stands.
> >>
> >>>Anyway - more data for you, but the interoperability is a mess,
> >>>and the BIDI handling of the implementations more so. If that was
> >>>real Arabic, it would be totally unreadable in 4 out of the 6
> >>>implementations...
> >>
> >>Thanks for looking into it.  I agree this is a bit of a mess, but on the
> >>bright side, it gives us the opportunity to make changes for the better.
> >
> >"Why can't they just speak English":-)
> >
> >Alex
> >
> >>--
> >
> >>Cameron McCormack ≝ http://mcc.id.au/
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> ><?xml version="1.0" encoding="utf-8"?>
> ><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
> >  <fo:layout-master-set>
> >    <fo:simple-page-master master-name="simple" page-height="29.7cm"
> page-width="21cm" margin="1cm 2.5cm 2cm 2.5cm">
> >      <fo:region-body margin-top="3cm"/>
> >      <fo:region-before extent="3cm"/>
> >      <fo:region-after extent="1.5cm"/>
> >    </fo:simple-page-master>
> >  </fo:layout-master-set>
> >  <fo:page-sequence master-reference="simple">
> >    <fo:flow flow-name="xsl-region-body">
> >      <fo:block-container writing-mode="lr">
> >        <fo:block><fo:bidi-override unicode-bidi="bidi-override"
> direction="rtl">AB</fo:bidi-override> cd</fo:block>
> >      </fo:block-container>
> >    </fo:flow>
> >  </fo:page-sequence>
> ></fo:root>
> >
> ><?xml version="1.0" encoding="utf-8"?>
> ><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
> >  <fo:layout-master-set>
> >    <fo:simple-page-master master-name="simple" page-height="29.7cm"
> page-width="21cm" margin="1cm 2.5cm 2cm 2.5cm">
> >      <fo:region-body margin-top="3cm"/>
> >      <fo:region-before extent="3cm"/>
> >      <fo:region-after extent="1.5cm"/>
> >    </fo:simple-page-master>
> >  </fo:layout-master-set>
> >  <fo:page-sequence master-reference="simple">
> >    <fo:flow flow-name="xsl-region-body">
> >      <fo:block-container writing-mode="rl">
> >        <fo:block><fo:bidi-override unicode-bidi="bidi-override"
> direction="rtl">AB</fo:bidi-override> cd</fo:block>
> >      </fo:block-container>
> >    </fo:flow>
> >  </fo:page-sequence>
> ></fo:root>
> >
> >
>
>

Received on Wednesday, 18 May 2011 05:27:05 UTC