W3C home > Mailing lists > Public > public-iri@w3.org > May 2010

RE: [bidi] Re: Special ordering for BIDI URLs

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Tue, 25 May 2010 16:42:17 -0400
To: "Phillips, Addison" <addison@lab126.com>, Mark Davis ☕ <mark@macchiato.com>, "Aharon (Vladimir) Lanin" <aharon@google.com>
CC: Martin J. Dürst <duerst@it.aoyama.ac.jp>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>, "Nasser Kettani" <Nasser.Kettani@microsoft.com>
Message-ID: <E14011F8737B524BB564B05FF748464A0D9E6754@TK5EX14MBXC139.redmond.corp.microsoft.com>
http://” and “.html” would probably make pure RTL IRIs difficult.  Personally, I think a “if it has RTL, then render the pieces in RTL order” approach is simplest.  (so http://a.B.C/d.e.F.html would render as http://a.B.C/d.e.F.html or html.F.e.d/C.B.a//:http, but not some mixed form.)  I’m probably oversimplifying it though.


From: Phillips, Addison [mailto:addison@lab126.com]
Sent: Pōʻ, Mei 25, 2010 12:17 PM
To: Mark Davis ☕; Aharon (Vladimir) Lanin
Cc: Shawn Steele; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Murray Sargent
Subject: RE: [bidi] Re: Special ordering for BIDI URLs

(chair hat off)

Adding RTL scheme identifiers is not going to be wholly effective. Only if you can have a completely pure RTL URI (all parts: path, query, scheme, etc.) can you completely avoid ambiguity in display of unadorned plain text URIs. But I don’t think that’s a reasonable approach: we don’t call them “bi-directional” languages for no reason. There is a lot of LTR data in the world that would like to be expressed in a URI.

I see Mark's point that requiring URI-awareness in plain text is a non-starter. I think limiting to unidirectional IRIs (either all LTR or all RTL) is a non-starter: there is no migration except for *total* migration.

Thinking about “specialized bidi”, the simplest solution I can think of is: give URIs an inherent LTR directionality (which is implied, at least, by a strongly LTR scheme and the tendency of DNS names to be LTR). I think this is what Slim is suggesting. It means that you would need to insert a left-to-right override in front of a bidi URI in running plain text, or, in the case of things like address bars, behave with an inherent LTR reading order. As a rule this could be understandable to users, and, since URIs today are in the main ASCII it might be the "least surprising" to users as they migrate to placing RTL text into a URI.

Here's an experiment (although I used actual Arabic text, I present as ASCII for convenience here. I use <lro> for the Unicode character). I typed four URIs:





I typed the above into Notepad and set the reading order to right to left and saw:





Note that the first two, in a left-to-right reading order displays as:



The LRO bearing versions display the same in both RTL and LTR contexts, although the path element order appears backwards to RTL readers. The unadorned text versions display "normally" (to an RTL reader) only when they are predominantly right-to-left with isolated LTR runs. They look broken (I suspect even to RTL readers) when there are successive left to right runs.

One downside is that it doesn't work very well in a markup environment. Consider:

<a href="<lro>http://CIBARA.com">Is<http://CIBARA.com%22%3eIs> the LRO part of the uri?</a>

If we are to print URIs on the sides of buses or on napkins under our tea cups, I'm not sure if it would be that bad to require the left-to-right reading order inherent in URI today as a "carryover" to IRIs. While unnatural to RTL speakers in the abstract, perhaps in practice "//:http" would seem unnatural to users (because they never see URIs like that) and it doesn't require any knowledge of the interior structure of a URI to apply an overall reading order in many (but not all) contexts.

I also see the other side of this argument. I must admit that I am in agreement with the sentiments in the email John Klensin just sent [1]. I think I tend to favor a solution that is more universal over one that requires a lot of specialized handling for bidi, but in practice this ensnares us in the corner cases inherent in UBA and disadvantages, at least to some degree, speakers of languages written in RTL scripts.


[1] http://lists.w3.org/Archives/Public/public-iri/2010May/0039.html

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Mark Davis ?
Sent: Tuesday, May 25, 2010 11:31 AM
To: Aharon (Vladimir) Lanin
Cc: Shawn Steele; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org; Murray Sargent
Subject: Re: [bidi] Re: Special ordering for BIDI URLs

It looks like we are having some useful discussions. Let me try to clarify a bit of what I said. My original message was getting longish, and I know people's eyes glaze when it gets too long, so I think I wasn't clear on a couple of matters.

At a high level, there are two choices (as far as I know):

1. Market Forces. Make it possible for URLs (actually IRIs) to be completely RTL, and push sites and programs to use them. Note that part of this can be adding mechanisms to URL-aware programs to flag to users when BIDI reordering is changing the order of labels, such as flagging them with a special format.

2. Specialized BIDI. Force a consistent order on URLs, using a higher-level protocol on top of the UBA.

You mention %, which is relevant to #1 and RLM/LRMs, which are relevant to #2.

A. As far as % goes, what that means is that every label can be constructed so as to contain no LTR characters. By "label", I mean in a broad sense, so each of the three letter sequences below counts as a label.


(The scheme is an exception: it has problems that Martin and John point out, but if that alone is LTR, it is not too bad; people can handle that being reordered if it is limited to it.)

The % is an issue, although in an ideal world its use would be minimized in what the user sees. Although the characters have to be % encoded or punycoded to go over the web, they can be restored for display to the user. That is, only occurring in a label where the character would have to be quoted in order to not have the label be terminated. We can discuss how to handle the cases where they cannot be minimized; how sites can work around it, whether the remaining cases represent a significant problem, and if so, whether there is some alternative syntax that could be used.

Where the query string contains LTR characters, there are a couple of choices. For most people, the query part is just technical gorp. And websites are able to put whatever they want into those strings; their interpretation is private to that site. So there are a couple of approaches (at least):

  *   Not really bother with it: if it contains LTR characters then it reorders in a funny way, but since it is technical gorp we don't care. A
  *   Have some simple standardized way of mapping LTR characters in the query part into bidi characters that sites can use if they want to be wholly RTL.

B. As far as RLM/LRMs, they are relevant to the Specialized BIDI approach. (As I said before, I have doubts as to whether this approach is viable, but it is worth pursuing how it could be).

What we recommend in the UBA is that if people are going to override the BIDI algorithm for any purpose, that they effectively do so by the insertion of bidi controls (we should make that recommendation clearer, however). So how would this play out with URLs?

  1.  I type a URL into an address bar. Since the program is URL-aware*, it parses out the labels. Based on whatever standard mechanism is defined (eg the URL contains a RTL character), it is detected as a BIDI label, and ordered consistently. Effectively, that is done by inserting RLM at the start of each label that doesn't begin with a RTL character and at the end of each label that doesn't end with a RTL character. One could use the embedding codes, but they are more dangerous.
  2.  This is the display form: when the URL is looked up, the RLMs have to be stripped before it is transformed into punycode and %escaped.
  3.  If I cut or copy that URL, then the RLMs go with it into plain text on the clipboard.
  4.  When I paste that address into plain text, it then appears in the same order as it was in the address bar.
Take another case:

  1.  I see a URL in some plain text (whether or not it is consistently ordered), and cut and paste that plaintext URL into an address bar (or other URL-aware* program). In that case, the program renormalizes the URL. That is, it strips out all bidi controls, and then reapplies the BIDI detection and RLM insertion. I then end up with consistent ordering in the result.
Note that in no cases would we expect people to manually put in the RLMs.

By URL-aware*, I mean that not only is it able to parse out URLs, but it also applies the special ordering. Initially, there are no such programs. And there are many problems with this approach: the old URL-aware programs would choke on the RLMs; old programs would behave differently from new programs; &c.

Received on Tuesday, 25 May 2010 20:42:21 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:39:41 UTC