- From: Glenn Adams <glenn@skynav.com>
- Date: Tue, 17 May 2011 23:26:15 -0600
- To: Alex Danilo <alex@abbra.com>
- Cc: Cameron McCormack <cam@mcc.id.au>, public-svg-wg@w3.org
- Message-ID: <BANLkTimsnBgdUB1NxZtUmN_PR3FQTrmTEQ@mail.gmail.com>
Alex, Glad we are converging. By the way, attached is a more complex example of vowelized Arabic, exhibiting a variety of advanced typographic features driven by the OpenType GSUB/GPOS processing, which I perform directly in FOP, including: - obligatory ligatures - conditional (font selected) ligatures - combining mark ligatures (mark on mark) - mark on base positioning - mark on ligature positioning - mark on mark positioning - base on base positioning (end of ayah) As you can see, there is some variation across the four fonts used in the example. FOP merely applies the available GSUB/GPOS tables during the character to glyph mapping (substitution) and positioning process, so it is the font designer that is determining the result to a significant extent. Cheers, Glenn On Tue, May 17, 2011 at 10:21 PM, Alex Danilo <alex@abbra.com> wrote: > Hi Glenn, > > Doing more experiments I agree with what you're saying. > > At issue is the unicode-bidi property handling. We don't > process it at all, and the lack of RLO and PDF behaviour results > in what Batik (and we) do. So, given the text in CSS3 and another > little test I wrote here we get the "BA cd" ordering as well. > > Thanks for running the tests. > > Alex > > --Original Message--: > > > >Alex, > > > >I'm afraid I don't agree with your statement that Batik is correct if it > produces "cd BA". [But read further in this message.] I have verified the > implementation I referred to in my previous email with the 216,357 test > sequences contained in the UAX#9 test suite [1], so I know that it is a > correct implementation. By the way, the presence of the space before "cd" > has no effect on the ordering. > > > >In this case, the resolved levels are: > > > >Input Text : AB cd > >Resolved Levels : 11000 > >After Reorder : BA cd > > > >See the updated attachments and also the following updated log (with a > single SPACE added before "cd"): > > > >BD: RESOLVE: org.apache.fop.fo.pagination.PageSequence@1acc0e01[@id=] > >BR[ 0, 4] : 1: SOR(R), EOR(R) > >BR[ 4, 7] : 0: SOR(R), EOR(L) > >RL: CC(7) > >BD: default level(0) > >‮ : RLO 1 > >A : L 1 > >B : L 1 > >‬ : PDF 1 > >  : WS 0 > >c : L 0 > >d : L 0 > >AL(1): B[0,1][0](1) > >AL(1): T[1,3][1](1) > >AL(1): B[3,4][3](1) > >AL(0): T[4,7][4](0) > >DR: block{ <‮AB‬ cd>, intervals > <B[0,1][0](1),T[1,3][1](1),B[3,4][3](1),T[4,7][4](0)>} > >BD: REORDER: INPUT: > >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content > = <AB> } > >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = > < > } > >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content > = <cd> } > >BD: REORDER: SPLIT INLINES: > >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 0, content > = <AB> } > >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = > < > } > >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content > = <cd> } > >BD: REORDER: { min = 0, max = 1} > >BD: REORDER: REORDERED RUNS: > >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content > = <AB> } > >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = > < > } > >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content > = <cd> } > >BD: REORDER: REORDERED WORDS: > >RR: { type = 'W', levels = '11', min = 1, max = 1, reversals = 1, content > = <BA> } > >RR: { type = 'S', levels = '0', min = 0, max = 0, reversals = 0, content = > < > } > >RR: { type = 'W', levels = '00', min = 0, max = 0, reversals = 0, content > = <cd> } > > > > > >Note well, however, that if the paragraph embedding level is changed to 1, > i.e., RTL, then you do in fact get "cd BA" as you suggest. In this case, one > would have the following: > > > >Input Text : AB cd > >Resolved Levels : 33122 > >After Reorder : cd BA > > > > > >Perhaps Batik is not correctly assigning the paragraph embedding level. > Since SVG doesn't have a paragraph construct per se, I would expect the > direction property on the <text/> element to determine the paragraph > embedding level. In Cameron's example, that was "rtl", which would make the > paragraph embedding level 0, not 1. > > > >I have also added text XSL-FO and output PDF files showing the results if > the paragraph embedding level is changed to 1 (here, by adding a > writing-mode='rl' attribute on fo:block-container). > > > >Regards, > >Glenn > > > >[1] http://www.unicode.org/Public/UNIDATA/BidiTest.txt > > > >On Tue, May 17, 2011 at 9:36 PM, Alex Danilo <alex@abbra.com> wrote: > > > >Hi Cam, > > > >--Original Message--: > >>Hi Alex. > >> > >>Thanks for the reply. Text is hard. ;) > > > >Sure is! > > > >>Cameron McCormack: > >>>>><style> > >>>>> text { direction: ltr } > >>>>> tspan { direction: rtl; unicode-bidi: bidi-override } > >>>>></style> > >>>>><text x="100"><tspan>AB</tspan> cd</text> > >>>>> > >>>>>The visual order of this is “BA cd”. The <text> has > >>>>>text-anchor:start. Where is the text positioned? > >>… > >>>> Gecko: |BA cd (where “|” is the vertical line at x = 100) > >>>> IE: |BA cd > >>>> WebKit: BA cd| > >>>> Opera: AB cd| > >>>> Batik: |cd BA > >> > >>Alex Danilo: > >>>OK, so our result (just to complicate things:-) is: > >>> > >>>Abbra: cd|BA > >>> > >>>The visual order of Batik is close to correct IMO. > > > >I just did a more detailed check with one of my BIDI implementations > >that passes a bunch of tests and is independent of the SVG engine. > > > >Batik is correct with the space placement, so "cd BA" is the expected > >result. I can't remember the detail why - UAX#9 has complex rules for > >numbers, white-space etc. and that just falls out. > > > >>That’s surprising to me, although I still don’t understand all the > >>intricacies of bidi layout so it could well be correct. Why isn’t > >>“BAcd” the correct visual order? Does the direction:ltr on the <text> > >>not make this an overall LTR chunk of text with an RTL run at the start > >>of it? If I changed the example to > > > >This is where I'm not sure what we expect to happen. > > > >If you treat the "AB cd" as the thing you do BIDI processing on, then > >the embed levels are 11100 - seems the white-space is white space and so > >gets the RTL level (I think, without jumping in the debugger I'm not > sure). > > > >> <text x="100">xy <tspan>AB</tspan> cd</text> > >> > >>I would expect the visual order to be “xy BA cd”, so I am confused as to > >>why removing the “xy” should result in the “BA” going to the right side > >>of the “cd”. > > > >Yes, the order is correct, since you have "xy AB cd" which gives a BIDI > >level ordering of 00011000 or something like that, so because it's > >embedded within 2 LTR runs it doesn't swap. The BIDI swapping happens > >from the ends with the same embedding level and so, the cases are > >different. > > > >It is possible to force the start embedding level - i.e. tell the BIDI > code that > >the starting level is '0' from the <text> and then the visual order would > be > >more like you expected but I'm not sure that will be possible for things > >like Batik, etc. that rely on external libraries to handle the reordering. > > > >The question is: does the absence of any logical characters at the start > >of a text chunk with given directionality affect BIDI processing? > > > >Have you tried setting RTL on the <text> and LTR on the <tspan> to see > >what the implementations do? That might make another interesting > >data point. > > > >Also - while writing this reply I forced the starting BIDI clasification > level > >in my code to 0 (LTR) and it still spat out the same layout. I don't want > to > >have to read UAX#9 again, not enough time right now but I expect the > >RTL as the first actual content is dictating it get placed at the right > side > >of the line since it is the starting text. > > > >>>Existence of the <tspan> doesn't create a new text chunk, it's just > >>>defining the directionality isn't it? If so, you are ordering the > >>>string: > >>> > >>>"AB cd" where "AB" is considered to be RTL, i.e. a UAX#9 embedding > >>>level of 1, whilst the " cd" has an embed level of 0. Running UAX#9 > >>>will swap the "AB" as "BA" across to the right _to be read_ as the > >>>first string in the RTL line. The visual order can't start with BA, > >>>that's just plain broken. > >> > >>Ah, so why is it an RTL line and not an LTR line? Is there a heuristic > >>there based on the first logical character being RTL meaning that the > >>line as a while is considered RTL? If so, does the direction:ltr not > >>override that? > > > >Maybe it should, but reading UAX#9 is better than guessing this one. > >If you analyze the lines purely by the characteristics > >of the logical characters you'll get what I was describing and Batik does > >(and by implication Java Unicode handling I guess). If you split into > chunks > >at the <tspan> boundaries you'll get the alternate. I don't know if the > spec. > >mandates any sort of implicit embedding level from the 'LTR' on the > <text> > >but then again perhaps it should... > > > >In CSS3 Writing modes http://dev.w3.org/csswg/css3-writing-modes/ > >I see: > > > >"? The ‘direction’ property has no effect on bidi reordering when > specified on inline elements whose ‘unicode-bidi’ property's value is > ‘normal’" > > > >and > > > >"bidi-override For inline elements this creates an override. For > block-container > >elements this creates an override for inline-level descendants not within > another > >block container element. This means that inside the element, reordering is > >strictly in sequence according to the ‘direction’ property; the implicit > part of the > >bidirectional algorithm is ignored. This corresponds to adding a LRO > (U+202D), > >for ‘direction: ltr’, or RLO (U+202E), for ‘direction: rtl’, at the start > of the element > >and a PDF (U+202C) at the end of the element" > > > >So maybe we need to construct a test with RLO and PDF instead of the > markup > >you have and see what happens. > > > >The first comment about the direction property doesn't seem to indicate we > should assume > >LTR for the content, since there is no bidi-override for the <text> > element itself. > > > >>>I don't see any prose in the spec. that says the existence of the > >>><tspan> or the unicode-bidi:bidi-override etc. create a new text > >>>chunk. So I think the re-order should happen on the entire text chunk > >>>since the <tspan> does not introduce a new 'X' position or anything > >>>else that could be considered a chunk maker. They are 2 'runs' of > >>>text, but still one chunk I would have thought. > >> > >>Yes I agree with that. (A “chunk maker” sounds like a particularly > >>nasty combination of alcoholic beverages. ;)) > > > >Indeed! > > > >>>Now as for the space - it's in the LTR content " cd" and since the > >>>"AB" gets swapped across to the right side, the space leads the "cd" > >>>and so there should be no space after the "cd". Are you sure Batik > >>>stuck a space in there? > >> > >>Yeah: http://mcc.id.au/temp/bps-batik.png > > > >As I said above Batik is correct here, if analyzing just the concatenated > string. > > > >>If I construct the equivalent HTML example (without the positioning, and > >>with a background colour on the RTL span): > >> > >> http://people.mozilla.org/~cmccormack/tests/bidi-simple.html > >> > >>then I find that browsers uniformly render it as “BA cd”. > > > >That's good to know. But are the browsers treating each <span> > >as what we term a chunk? It would be good to stick some Arabic > >characters and latin in a single string to see what they do without > >the explicit settings, then separate the Arabic with a <span> to > >see if the BIDI analysis is being done entire line, or piece by piece. > > > >>>As for "current text position", I think what we're doing is wrong > >>>here. From the text in the spec. I'd expect to see: > >>> > >>>"cdBA|" > >>> > >>>namely, that the first logical character (the start character) is > >>>placed to the left of the starting position. > >> > >>OK. I’ll wait to see your reasoning on the “cdBA” layout as opposed to > >>“BAcd”, but if you are right then that does make sense. If “BAcd” is > >>the right layout (which is what I was assuming) then it’s trickier, and > >>my questions from my original mail about what that x="100" actually > >>means stands. > >> > >>>Anyway - more data for you, but the interoperability is a mess, > >>>and the BIDI handling of the implementations more so. If that was > >>>real Arabic, it would be totally unreadable in 4 out of the 6 > >>>implementations... > >> > >>Thanks for looking into it. I agree this is a bit of a mess, but on the > >>bright side, it gives us the opportunity to make changes for the better. > > > >"Why can't they just speak English":-) > > > >Alex > > > >>-- > > > >>Cameron McCormack ≝ http://mcc.id.au/ > >> > >> > >> > > > > > > > > > > > > > ><?xml version="1.0" encoding="utf-8"?> > ><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format"> > > <fo:layout-master-set> > > <fo:simple-page-master master-name="simple" page-height="29.7cm" > page-width="21cm" margin="1cm 2.5cm 2cm 2.5cm"> > > <fo:region-body margin-top="3cm"/> > > <fo:region-before extent="3cm"/> > > <fo:region-after extent="1.5cm"/> > > </fo:simple-page-master> > > </fo:layout-master-set> > > <fo:page-sequence master-reference="simple"> > > <fo:flow flow-name="xsl-region-body"> > > <fo:block-container writing-mode="lr"> > > <fo:block><fo:bidi-override unicode-bidi="bidi-override" > direction="rtl">AB</fo:bidi-override> cd</fo:block> > > </fo:block-container> > > </fo:flow> > > </fo:page-sequence> > ></fo:root> > > > ><?xml version="1.0" encoding="utf-8"?> > ><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format"> > > <fo:layout-master-set> > > <fo:simple-page-master master-name="simple" page-height="29.7cm" > page-width="21cm" margin="1cm 2.5cm 2cm 2.5cm"> > > <fo:region-body margin-top="3cm"/> > > <fo:region-before extent="3cm"/> > > <fo:region-after extent="1.5cm"/> > > </fo:simple-page-master> > > </fo:layout-master-set> > > <fo:page-sequence master-reference="simple"> > > <fo:flow flow-name="xsl-region-body"> > > <fo:block-container writing-mode="rl"> > > <fo:block><fo:bidi-override unicode-bidi="bidi-override" > direction="rtl">AB</fo:bidi-override> cd</fo:block> > > </fo:block-container> > > </fo:flow> > > </fo:page-sequence> > ></fo:root> > > > > > >
Attachments
- text/xml attachment: test.fo.xml
- application/pdf attachment: test.pdf
Received on Wednesday, 18 May 2011 05:27:05 UTC