Re: [bidi] Re: Special ordering for BIDI URLs from Mark Davis ☕ on 2010-05-25 (public-iri@w3.org from May 2010)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Tue, 25 May 2010 15:42:39 -0700
To: Shawn Steele <Shawn.Steele@microsoft.com>
Cc: "Phillips, Addison" <addison@lab126.com>, "Aharon (Vladimir) Lanin" <aharon@google.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>, Nasser Kettani <Nasser.Kettani@microsoft.com>
Message-ID: <AANLkTilTw3rLrqN32H6mMKmsRkG9JUZxIwk6x8JmxBEh@mail.gmail.com>
I agree that http and html are issues. http is probably not a big one; if
the rest of the URL were ok, it wouldn't matter much if that were at the
start or end. html (htm, pdf, ...) are more of a problem.

It is unclear whether you mean that the ordering of the labels is always
LTR, or that the URL is treated as if it were in a LTR context.

Mark

— Il meglio è l’inimico del bene —


On Tue, May 25, 2010 at 13:42, Shawn Steele <Shawn.Steele@microsoft.com>wrote:

> “http://” and “.html” would probably make pure RTL IRIs difficult.
>  Personally, I think a “if it has RTL, then render the pieces in RTL order”
> approach is simplest.  (so http://a.B.C/d.e.F.html would render as
> http://a.B.C/d.e.F.html or html.F.e.d/C.B.a//:http, but not some mixed
> form.)  I’m probably oversimplifying it though.
>
>
>
> -Shawn
>
>
>
> *From:* Phillips, Addison [mailto:addison@lab126.com]
> *Sent:* Pōʻ, Mei 25, 2010 12:17 PM
> *To:* Mark Davis ☕; Aharon (Vladimir) Lanin
>
> *Cc:* Shawn Steele; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org;
> Murray Sargent
> *Subject:* RE: [bidi] Re: Special ordering for BIDI URLs
>
>
>
> (chair hat off)
>
>
>
> Adding RTL scheme identifiers is not going to be wholly effective. Only if
> you can have a completely pure RTL URI (all parts: path, query, scheme,
> etc.) can you completely avoid ambiguity in display of unadorned plain text
> URIs. But I don’t think that’s a reasonable approach: we don’t call them
> “bi-directional” languages for no reason. There is a lot of LTR data in the
> world that would like to be expressed in a URI.
>
>
>
> I see Mark's point that requiring URI-awareness in plain text is a
> non-starter. I think limiting to unidirectional IRIs (either all LTR or all
> RTL) is a non-starter: there is no migration except for *total* migration.
>
>
>
> Thinking about “specialized bidi”, the simplest solution I can think of is:
> give URIs an inherent LTR directionality (which is implied, at least, by a
> strongly LTR scheme and the tendency of DNS names to be LTR). I think this
> is what Slim is suggesting. It means that you would need to insert a
> left-to-right override in front of a bidi URI in running plain text, or, in
> the case of things like address bars, behave with an inherent LTR reading
> order. As a rule this could be understandable to users, and, since URIs
> today are in the main ASCII it might be the "least surprising" to users as
> they migrate to placing RTL text into a URI.
>
>
>
> Here's an experiment (although I used actual Arabic text, I present as
> ASCII for convenience here. I use <lro> for the Unicode character). I typed
> four URIs:
>
>
>
> http://example.com/1CIBARA/2CIBARA
>
> http://CIBARA.com/1CIBARA/2CIBARA
>
> <lro>http://example.com/1CIBARA/2CIBARA
>
> <lro>http://CIBARA.com/1CIBARA/2CIBARA
>
>
>
> I typed the above into Notepad and set the reading order to right to left
> and saw:
>
>
>
> 2CIBARA/1CIBARA/http://example.com
>
> 2CIBARA/1CIBARA/com.CIBARA//:http
>
> http://example.com/1CIBARA/2CIBARA
>
> http://CIBARA.com/1CIBARA/2CIBARA
>
>
>
> Note that the first two, in a left-to-right reading order displays as:
>
>
>
> http://example.com/2CIBARA/1CIBARA
>
> http://CIBARA.com/2CIBARA/1CIBARA
>
>
>
> The LRO bearing versions display the same in both RTL and LTR contexts,
> although the path element order appears backwards to RTL readers. The
> unadorned text versions display "normally" (to an RTL reader) only when they
> are predominantly right-to-left with isolated LTR runs. They look broken (I
> suspect even to RTL readers) when there are successive left to right runs.
>
>
>
> One downside is that it doesn't work very well in a markup environment.
> Consider:
>
>
>
> <a href="<lro>http://CIBARA.com">Is <http://CIBARA.com%22%3eIs> the LRO
> part of the uri?</a>
>
>
>
> If we are to print URIs on the sides of buses or on napkins under our tea
> cups, I'm not sure if it would be that bad to require the left-to-right
> reading order inherent in URI today as a "carryover" to IRIs. While
> unnatural to RTL speakers in the abstract, perhaps in practice "//:http"
> would seem unnatural to users (because they never see URIs like that) and it
> doesn't require any knowledge of the interior structure of a URI to apply an
> overall reading order in many (but not all) contexts.
>
>
>
> I also see the other side of this argument. I must admit that I am in
> agreement with the sentiments in the email John Klensin just sent [1]. I
> think I tend to favor a solution that is more universal over one that
> requires a lot of specialized handling for bidi, but in practice this
> ensnares us in the corner cases inherent in UBA and disadvantages, at least
> to some degree, speakers of languages written in RTL scripts.
>
>
>
> Addison
>
>
>
> [1] http://lists.w3.org/Archives/Public/public-iri/2010May/0039.html
>
>
>
> Addison Phillips
>
> Globalization Architect (Lab126)
>
> Chair (W3C I18N, IETF IRI WGs)
>
>
>
> Internationalization is not a feature.
>
> It is an architecture.
>
>
>
> *From:* public-iri-request@w3.org [mailto:public-iri-request@w3.org] *On
> Behalf Of *Mark Davis ?
> *Sent:* Tuesday, May 25, 2010 11:31 AM
> *To:* Aharon (Vladimir) Lanin
> *Cc:* Shawn Steele; Martin J. Dürst; public-iri@w3.org; bidi@unicode.org;
> Murray Sargent
> *Subject:* Re: [bidi] Re: Special ordering for BIDI URLs
>
>
>
> It looks like we are having some useful discussions. Let me try to clarify
> a bit of what I said. My original message was getting longish, and I know
> people's eyes glaze when it gets too long, so I think I wasn't clear on a
> couple of matters.
>
>
>
> At a high level, there are two choices (as far as I know):
>
>
>
> *1. Market Forces.* Make it possible for URLs (actually IRIs) to be
> completely RTL, and push sites and programs to use them. Note that part of
> this can be adding mechanisms to URL-aware programs to flag to users when
> BIDI reordering is changing the order of labels, such as flagging them with
> a special format.
>
>
>
> *2. Specialized BIDI. *Force a consistent order on URLs, using a
> higher-level protocol on top of the UBA.
>
>
>
> You mention %, which is relevant to #1 and RLM/LRMs, which are relevant to
> #2.
>
>
>
>
>
> *A. *As far as % goes, what that means is that every label can be
> constructed so as to contain no LTR characters. By "label", I mean in a
> broad sense, so each of the three letter sequences below counts as a label.
>
>
>
> http://abc.def.ghi/jkl/mno?pqr=stu&vwx=yza#bcd
>
>
>
> (The scheme is an exception: it has problems that Martin and John point
> out, but if that alone is LTR, it is not too bad; people can handle that
> being reordered if it is limited to it.)
>
>
>
> The % is an issue, although in an ideal world its use would be minimized in
> what the user sees. Although the characters have to be % encoded or
> punycoded to go over the web, they can be restored for display to the user.
> That is, only occurring in a label where the character would have to be
> quoted in order to not have the label be terminated. We can discuss how to
> handle the cases where they cannot be minimized; how sites can work around
> it, whether the remaining cases represent a significant problem, and if so,
> whether there is some alternative syntax that could be used.
>
>
>
> Where the query string contains LTR characters, there are a couple of
> choices. For most people, the query part is just technical gorp. And
> websites are able to put whatever they want into those strings; their
> interpretation is private to that site. So there are a couple of approaches
> (at least):
>
>    - Not really bother with it: if it contains LTR characters then it
>    reorders in a funny way, but since it is technical gorp we don't care. A
>    - Have some simple standardized way of mapping LTR characters in the
>    query part into bidi characters that sites can use if they want to be wholly
>    RTL.
>
>
>
> *B. *As far as RLM/LRMs, they are relevant to the Specialized BIDI
> approach. (As I said before, I have doubts as to whether this approach is
> viable, but it is worth pursuing how it could be).
>
>
>
> What we recommend in the UBA is that if people are going to override the
> BIDI algorithm for any purpose, that they effectively do so by the insertion
> of bidi controls (we should make that recommendation clearer, however). So
> how would this play out with URLs?
>
>    1. I type a URL into an address bar. Since the program is URL-aware*,
>    it parses out the labels. Based on whatever standard mechanism is defined
>    (eg the URL contains a RTL character), it is detected as a BIDI label, and
>    ordered consistently. Effectively, that is done by inserting RLM at the
>    start of each label that doesn't begin with a RTL character and at the end
>    of each label that doesn't end with a RTL character. One could use the
>    embedding codes, but they are more dangerous.
>    2. This is the display form: when the URL is looked up, the RLMs have
>    to be stripped before it is transformed into punycode and %escaped.
>    3. If I cut or copy that URL, then the RLMs go with it into plain text
>    on the clipboard.
>    4. When I paste that address into plain text, it then appears in the
>    same order as it was in the address bar.
>
> Take another case:
>
>    1. I see a URL in some plain text (whether or not it is consistently
>    ordered), and cut and paste that plaintext URL into an address bar (or other
>    URL-aware* program). In that case, the program *renormalizes* the URL.
>    That is, it strips out all bidi controls, and then reapplies the BIDI
>    detection and RLM insertion. I then end up with consistent ordering in the
>    result.
>
> Note that in no cases would we expect people to manually put in the RLMs.
>
>
>
> By URL-aware*, I mean that not only is it able to parse out URLs, but it
> also applies the special ordering. Initially, there are no such programs.
> And there are many problems with this approach: the old URL-aware programs
> would choke on the RLMs; old programs would behave differently from new
> programs; &c.
>
>
>
> Mark
>
Received on Tuesday, 25 May 2010 22:43:14 UTC