RE: [bidi] Re: Special ordering for BIDI URLs from Shawn Steele on 2010-05-26 (public-iri@w3.org from May 2010)

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Wed, 26 May 2010 00:33:47 +0000
To: Murray Sargent <murrays@exchange.microsoft.com>, "Phillips, Addison" <addison@lab126.com>, Mark Davis ☕ <mark@macchiato.com>
CC: "Aharon (Vladimir) Lanin" <aharon@google.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Nasser Kettani <Nasser.Kettani@microsoft.com>
Message-ID: <E14011F8737B524BB564B05FF748464A0D9E6B6F@TK5EX14MBXC139.redmond.corp.microsoft.>
Yes, it pretty much requires knowing that it’s a URI and doing a special case for that.   I realize that isn’t perfect, but I think it best reflects the user’s desired behavior.

To back up and be more general, “a problem” with the Unicode BIDI algorithm is that it makes presumptions about the format of strings that don’t necessarily apply in all contexts.  Adding override marks helps a little, but it still requires detecting contexts.  IRIs are a special case, but I think you could get these problems with other sets as well.  “(aardvark, BEE, CAT, dog, elephant)” would probably render in an unexpected way, for example.  I think the BIDI algorithm does ok for a general algorithm, but that’s why other contexts might need tailoring.

Note:  I’m getting dangerously out of my depth as well ☺

-Shawn

From: Murray Sargent
Sent: Pōʻ, Mei 25, 2010 4:10 PM
To: Phillips, Addison; Shawn Steele; Mark Davis ☕
Cc: Aharon (Vladimir) Lanin; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Nasser Kettani
Subject: RE: [bidi] Re: Special ordering for BIDI URLs

I think what Shawn is recommending is the method I describe in Tailoring the Unicode Bidi Algorithm<http://blogs.msdn.com/b/murrays/archive/2010/04/07/tailoring-the-unicode-bidi-algorithm.aspx> in the section on IRIs. Namely we force the delimiters '#', '.', '/', ':', '?', '@', '[', ']' to follow the paragraph (or embedding) direction. This does require IRI recognition, but that’s pretty commonplace.

Murray

From: Phillips, Addison [mailto:addison@lab126.com]
Sent: Tuesday, May 25, 2010 4:07 PM
To: Shawn Steele; Mark Davis ☕
Cc: Aharon (Vladimir) Lanin; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Murray Sargent; Nasser Kettani
Subject: RE: [bidi] Re: Special ordering for BIDI URLs

The labels? Or the URI?

When I expand your example I get:

http://apple.bee.CIBARA/dog.ear.WERBEH.html


and:

html.WERBEH.ear.dog/CIBARA.bee.apple//:http

The latter doesn’t seem at all possible without knowing that it is a URI and providing special processing. RLO reverses the left-to-right sequences entirely (“CIBARA.eeb.elppa//:ptth”) and RLE isn’t enough (“CIBARAhttp://apple.bee.”).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

From: Shawn Steele [mailto:Shawn.Steele@microsoft.com]
Sent: Tuesday, May 25, 2010 4:02 PM
To: Mark Davis ☕
Cc: Phillips, Addison; Aharon (Vladimir) Lanin; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Murray Sargent; Nasser Kettani
Subject: RE: [bidi] Re: Special ordering for BIDI URLs

I meant that in a LTR context it’d always be LTR.  In a bidi/RTL context the ordering would be RTL.  That could, perhaps, include LTR text being in a bidi mode if that was appropriate.

-Shawn

From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis ?
Sent: Pōʻ, Mei 25, 2010 3:43 PM
To: Shawn Steele
Cc: Phillips, Addison; Aharon (Vladimir) Lanin; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Murray Sargent; Nasser Kettani
Subject: Re: [bidi] Re: Special ordering for BIDI URLs

I agree that http and html are issues. http is probably not a big one; if the rest of the URL were ok, it wouldn't matter much if that were at the start or end. html (htm, pdf, ...) are more of a problem.

It is unclear whether you mean that the ordering of the labels is always LTR, or that the URL is treated as if it were in a LTR context.

Mark

— Il meglio è l’inimico del bene —
On Tue, May 25, 2010 at 13:42, Shawn Steele <Shawn.Steele@microsoft.com<mailto:Shawn.Steele@microsoft.com>> wrote:
“http://” and “.html” would probably make pure RTL IRIs difficult.  Personally, I think a “if it has RTL, then render the pieces in RTL order” approach is simplest.  (so http://a.B.C/d.e.F.html would render as http://a.B.C/d.e.F.html or html.F.e.d/C.B.a//:http, but not some mixed form.)  I’m probably oversimplifying it though.

-Shawn

From: Phillips, Addison [mailto:addison@lab126.com<mailto:addison@lab126.com>]
Sent: Pōʻ, Mei 25, 2010 12:17 PM
To: Mark Davis ☕; Aharon (Vladimir) Lanin

Cc: Shawn Steele; Martin J. Dürst; public-iri@w3.org<mailto:public-iri@w3.org>; bidi@unicode.org<mailto:bidi@unicode.org>; Murray Sargent
Subject: RE: [bidi] Re: Special ordering for BIDI URLs


(chair hat off)



Adding RTL scheme identifiers is not going to be wholly effective. Only if you can have a completely pure RTL URI (all parts: path, query, scheme, etc.) can you completely avoid ambiguity in display of unadorned plain text URIs. But I don’t think that’s a reasonable approach: we don’t call them “bi-directional” languages for no reason. There is a lot of LTR data in the world that would like to be expressed in a URI.



I see Mark's point that requiring URI-awareness in plain text is a non-starter. I think limiting to unidirectional IRIs (either all LTR or all RTL) is a non-starter: there is no migration except for *total* migration.



Thinking about “specialized bidi”, the simplest solution I can think of is: give URIs an inherent LTR directionality (which is implied, at least, by a strongly LTR scheme and the tendency of DNS names to be LTR). I think this is what Slim is suggesting. It means that you would need to insert a left-to-right override in front of a bidi URI in running plain text, or, in the case of things like address bars, behave with an inherent LTR reading order. As a rule this could be understandable to users, and, since URIs today are in the main ASCII it might be the "least surprising" to users as they migrate to placing RTL text into a URI.



Here's an experiment (although I used actual Arabic text, I present as ASCII for convenience here. I use <lro> for the Unicode character). I typed four URIs:



http://example.com/1CIBARA/2CIBARA


http://CIBARA.com/1CIBARA/2CIBARA


<lro>http://example.com/1CIBARA/2CIBARA

<lro>http://CIBARA.com/1CIBARA/2CIBARA




I typed the above into Notepad and set the reading order to right to left and saw:



2CIBARA/1CIBARA/http://example.com


2CIBARA/1CIBARA/com.CIBARA//:http

http://example.com/1CIBARA/2CIBARA


http://CIBARA.com/1CIBARA/2CIBARA




Note that the first two, in a left-to-right reading order displays as:



http://example.com/2CIBARA/1CIBARA


http://CIBARA.com/2CIBARA/1CIBARA




The LRO bearing versions display the same in both RTL and LTR contexts, although the path element order appears backwards to RTL readers. The unadorned text versions display "normally" (to an RTL reader) only when they are predominantly right-to-left with isolated LTR runs. They look broken (I suspect even to RTL readers) when there are successive left to right runs.



One downside is that it doesn't work very well in a markup environment. Consider:



<a href="<lro>http://CIBARA.com">Is<http://CIBARA.com%22%3eIs> the LRO part of the uri?</a>



If we are to print URIs on the sides of buses or on napkins under our tea cups, I'm not sure if it would be that bad to require the left-to-right reading order inherent in URI today as a "carryover" to IRIs. While unnatural to RTL speakers in the abstract, perhaps in practice "//:http" would seem unnatural to users (because they never see URIs like that) and it doesn't require any knowledge of the interior structure of a URI to apply an overall reading order in many (but not all) contexts.



I also see the other side of this argument. I must admit that I am in agreement with the sentiments in the email John Klensin just sent [1]. I think I tend to favor a solution that is more universal over one that requires a lot of specialized handling for bidi, but in practice this ensnares us in the corner cases inherent in UBA and disadvantages, at least to some degree, speakers of languages written in RTL scripts.



Addison



[1] http://lists.w3.org/Archives/Public/public-iri/2010May/0039.html


Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

From: public-iri-request@w3.org<mailto:public-iri-request@w3.org> [mailto:public-iri-request@w3.org<mailto:public-iri-request@w3.org>] On Behalf Of Mark Davis ?
Sent: Tuesday, May 25, 2010 11:31 AM
To: Aharon (Vladimir) Lanin
Cc: Shawn Steele; Martin J. Dürst; public-iri@w3.org<mailto:public-iri@w3.org>; bidi@unicode.org<mailto:bidi@unicode.org>; Murray Sargent
Subject: Re: [bidi] Re: Special ordering for BIDI URLs

It looks like we are having some useful discussions. Let me try to clarify a bit of what I said. My original message was getting longish, and I know people's eyes glaze when it gets too long, so I think I wasn't clear on a couple of matters.

At a high level, there are two choices (as far as I know):

1. Market Forces. Make it possible for URLs (actually IRIs) to be completely RTL, and push sites and programs to use them. Note that part of this can be adding mechanisms to URL-aware programs to flag to users when BIDI reordering is changing the order of labels, such as flagging them with a special format.

2. Specialized BIDI. Force a consistent order on URLs, using a higher-level protocol on top of the UBA.

You mention %, which is relevant to #1 and RLM/LRMs, which are relevant to #2.


A. As far as % goes, what that means is that every label can be constructed so as to contain no LTR characters. By "label", I mean in a broad sense, so each of the three letter sequences below counts as a label.

http://abc.def.ghi/jkl/mno?pqr=stu&vwx=yza#bcd


(The scheme is an exception: it has problems that Martin and John point out, but if that alone is LTR, it is not too bad; people can handle that being reordered if it is limited to it.)

The % is an issue, although in an ideal world its use would be minimized in what the user sees. Although the characters have to be % encoded or punycoded to go over the web, they can be restored for display to the user. That is, only occurring in a label where the character would have to be quoted in order to not have the label be terminated. We can discuss how to handle the cases where they cannot be minimized; how sites can work around it, whether the remaining cases represent a significant problem, and if so, whether there is some alternative syntax that could be used.

Where the query string contains LTR characters, there are a couple of choices. For most people, the query part is just technical gorp. And websites are able to put whatever they want into those strings; their interpretation is private to that site. So there are a couple of approaches (at least):

  *   Not really bother with it: if it contains LTR characters then it reorders in a funny way, but since it is technical gorp we don't care. A
  *   Have some simple standardized way of mapping LTR characters in the query part into bidi characters that sites can use if they want to be wholly RTL.

B. As far as RLM/LRMs, they are relevant to the Specialized BIDI approach. (As I said before, I have doubts as to whether this approach is viable, but it is worth pursuing how it could be).

What we recommend in the UBA is that if people are going to override the BIDI algorithm for any purpose, that they effectively do so by the insertion of bidi controls (we should make that recommendation clearer, however). So how would this play out with URLs?

  1.  I type a URL into an address bar. Since the program is URL-aware*, it parses out the labels. Based on whatever standard mechanism is defined (eg the URL contains a RTL character), it is detected as a BIDI label, and ordered consistently. Effectively, that is done by inserting RLM at the start of each label that doesn't begin with a RTL character and at the end of each label that doesn't end with a RTL character. One could use the embedding codes, but they are more dangerous.
  2.  This is the display form: when the URL is looked up, the RLMs have to be stripped before it is transformed into punycode and %escaped.
  3.  If I cut or copy that URL, then the RLMs go with it into plain text on the clipboard.
  4.  When I paste that address into plain text, it then appears in the same order as it was in the address bar.
Take another case:

  1.  I see a URL in some plain text (whether or not it is consistently ordered), and cut and paste that plaintext URL into an address bar (or other URL-aware* program). In that case, the program renormalizes the URL. That is, it strips out all bidi controls, and then reapplies the BIDI detection and RLM insertion. I then end up with consistent ordering in the result.
Note that in no cases would we expect people to manually put in the RLMs.

By URL-aware*, I mean that not only is it able to parse out URLs, but it also applies the special ordering. Initially, there are no such programs. And there are many problems with this approach: the old URL-aware programs would choke on the RLMs; old programs would behave differently from new programs; &c.

Mark
Received on Wednesday, 26 May 2010 00:35:41 UTC