Re: bidi and the initial current text position

Hi Cam,

--Original Message--:
>Hi Alex.
>
>Thanks for the reply.  Text is hard. ;)

Sure is! 

>Cameron McCormack:
>> > > <style>
>> > >   text { direction: ltr }
>> > >   tspan { direction: rtl; unicode-bidi: bidi-override }
>> > > </style>
>> > > <text x="100"><tspan>AB</tspan> cd</text>
>> > > 
>> > > The visual order of this is “BA cd”. The <text> has
>> > > text-anchor:start. Where is the text positioned?
>…
>> >  Gecko:        |BA cd  (where “|” is the vertical line at x = 100)
>> >  IE:           |BA cd
>> >  WebKit:  BA cd|
>> >  Opera:   AB cd|
>> >  Batik:        |cd BA
>
>Alex Danilo:
>> OK, so our result (just to complicate things:-) is:
>> 
>> Abbra: cd|BA
>> 
>> The visual order of Batik is close to correct IMO.

I just did a more detailed check with one of my BIDI implementations
that passes a bunch of tests and is independent of the SVG engine.

Batik is correct with the space placement, so "cd BA" is the expected
result. I can't remember the detail why - UAX#9 has complex rules for
numbers, white-space etc. and that just falls out.

>That’s surprising to me, although I still don’t understand all the
>intricacies of bidi layout so it could well be correct.  Why isn’t
>“BAcd” the correct visual order?  Does the direction:ltr on the <text>
>not make this an overall LTR chunk of text with an RTL run at the start
>of it?  If I changed the example to

This is where I'm not sure what we expect to happen.

If you treat the "AB cd" as the thing you do BIDI processing on, then
the embed levels are 11100 - seems the white-space is white space and so
gets the RTL level (I think, without jumping in the debugger I'm not sure).

>  <text x="100">xy <tspan>AB</tspan> cd</text>
>
>I would expect the visual order to be “xy BA cd”, so I am confused as to
>why removing the “xy” should result in the “BA” going to the right side
>of the “cd”.

Yes, the order is correct, since you have "xy AB cd" which gives a BIDI
level ordering of 00011000 or something like that, so because it's
embedded within 2 LTR runs it doesn't swap. The BIDI swapping happens
from the ends with the same embedding level and so, the cases are
different.

It is possible to force the start embedding level - i.e. tell the BIDI code that
the starting level is '0' from the <text> and then the visual order would be
more like you expected but I'm not sure that will be possible for things
like Batik, etc. that rely on external libraries to handle the reordering.

The question is: does the absence of any logical characters at the start
of a text chunk with given directionality affect BIDI processing?

Have you tried setting RTL on the <text> and LTR on the <tspan> to see
what the implementations do? That might make another interesting
data point.

Also - while writing this reply I forced the starting BIDI clasification level
in my code to 0 (LTR) and it still spat out the same layout. I don't want to
have to read UAX#9 again, not enough time right now but I expect the
RTL as the first actual content is dictating it get placed at the right side
of the line since it is the starting text.

>> Existence of the <tspan> doesn't create a new text chunk, it's just
>> defining the directionality isn't it? If so, you are ordering the
>> string:
>> 
>> "AB cd" where "AB" is considered to be RTL, i.e. a UAX#9 embedding
>> level of 1, whilst the " cd" has an embed level of 0. Running UAX#9
>> will swap the "AB" as "BA" across to the right _to be read_ as the
>> first string in the RTL line. The visual order can't start with BA,
>> that's just plain broken.
>
>Ah, so why is it an RTL line and not an LTR line?  Is there a heuristic
>there based on the first logical character being RTL meaning that the
>line as a while is considered RTL?  If so, does the direction:ltr not
>override that?

Maybe it should, but reading UAX#9 is better than guessing this one.
If you analyze the lines purely by the characteristics
of the logical characters you'll get what I was describing and Batik does
(and by implication Java Unicode handling I guess). If you split into chunks
at the <tspan> boundaries you'll get the alternate. I don't know if the spec.
mandates any sort of implicit embedding level from the 'LTR' on  the <text>
but then again perhaps it should...

In CSS3 Writing modes http://dev.w3.org/csswg/css3-writing-modes/
I see:

"? The ‘direction’ property has no effect on bidi reordering when specified on inline elements whose ‘unicode-bidi’ property's value is ‘normal’"

and

"bidi-override For inline elements this creates an override. For block-container
elements this creates an override for inline-level descendants not within another
block container element. This means that inside the element, reordering is
strictly in sequence according to the ‘direction’ property; the implicit part of the
bidirectional algorithm is ignored. This corresponds to adding a LRO (U+202D),
for ‘direction: ltr’, or RLO (U+202E), for ‘direction: rtl’, at the start of the element
and a PDF (U+202C) at the end of the element"

So maybe we need to construct a test with RLO and PDF instead of the markup
you have and see what happens.

The first comment about the direction property doesn't seem to indicate we should assume
LTR for the content, since there is no bidi-override for the <text> element itself.

>> I don't see any prose in the spec. that says the existence of the
>> <tspan> or the unicode-bidi:bidi-override etc. create a new text
>> chunk. So I think the re-order should happen on the entire text chunk
>> since the <tspan> does not introduce a new 'X' position or anything
>> else that could be considered a chunk maker. They are 2 'runs' of
>> text, but still one chunk I would have thought.
>
>Yes I agree with that.  (A “chunk maker” sounds like a particularly
>nasty combination of alcoholic beverages. ;))

Indeed!

>> Now as for the space - it's in the LTR content " cd" and since the
>> "AB" gets swapped across to the right side, the space leads the "cd"
>> and so there should be no space after the "cd". Are you sure Batik
>> stuck a space in there?
>
>Yeah: http://mcc.id.au/temp/bps-batik.png

As I said above Batik is correct here, if analyzing just the concatenated string.

>If I construct the equivalent HTML example (without the positioning, and
>with a background colour on the RTL span):
>
>  http://people.mozilla.org/~cmccormack/tests/bidi-simple.html
>
>then I find that browsers uniformly render it as “BA cd”.

That's good to know. But are the browsers treating each <span>
as what we term a chunk? It would be good to stick some Arabic
characters and latin in a single string to see what they do without
the explicit settings, then separate the Arabic with a <span> to
see if the BIDI analysis is being done entire line, or piece by piece.

>> As for "current text position", I think what we're doing is wrong
>> here. From the text in the spec. I'd expect to see:
>> 
>> " cdBA|"
>> 
>> namely, that the first logical character (the start character) is
>> placed to the left of the starting position.
>
>OK.  I’ll wait to see your reasoning on the “cdBA” layout as opposed to
>“BAcd”, but if you are right then that does make sense.  If “BAcd” is
>the right layout (which is what I was assuming) then it’s trickier, and
>my questions from my original mail about what that x="100" actually
>means stands.
>
>> Anyway - more data for you, but the interoperability is a mess,
>> and the BIDI handling of the implementations more so. If that was
>> real Arabic, it would be totally unreadable in 4 out of the 6
>> implementations...
>
>Thanks for looking into it.  I agree this is a bit of a mess, but on the
>bright side, it gives us the opportunity to make changes for the better.

"Why can't they just speak English":-)

Alex

>-- 
>Cameron McCormack ≝ http://mcc.id.au/
>
>
>

Received on Wednesday, 18 May 2011 03:37:09 UTC