Re: [bidi] Re: Special ordering for BIDI URLs from Mark Davis ☕ on 2010-05-25 (public-iri@w3.org from May 2010)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Tue, 25 May 2010 11:31:04 -0700
To: "Aharon (Vladimir) Lanin" <aharon@google.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>
Message-ID: <AANLkTimcXSdQYBg_LYqqumCKshA4IOFGuRmwUCgggvXy@mail.gmail.com>

It looks like we are having some useful discussions. Let me try to clarify a
bit of what I said. My original message was getting longish, and I know
people's eyes glaze when it gets too long, so I think I wasn't clear on a
couple of matters.

At a high level, there are two choices (as far as I know):

*1. Market Forces.* Make it possible for URLs (actually IRIs) to be
completely RTL, and push sites and programs to use them. Note that part of
this can be adding mechanisms to URL-aware programs to flag to users when
BIDI reordering is changing the order of labels, such as flagging them with
a special format.

*2. Specialized BIDI. *Force a consistent order on URLs, using a
higher-level protocol on top of the UBA.

You mention %, which is relevant to #1 and RLM/LRMs, which are relevant to
#2.


*A. *As far as % goes, what that means is that every label can be
constructed so as to contain no LTR characters. By "label", I mean in a
broad sense, so each of the three letter sequences below counts as a label.

http://abc.def.ghi/jkl/mno?pqr=stu&vwx=yza#bcd

(The scheme is an exception: it has problems that Martin and John point out,
but if that alone is LTR, it is not too bad; people can handle that being
reordered if it is limited to it.)

The % is an issue, although in an ideal world its use would be minimized in
what the user sees. Although the characters have to be % encoded or
punycoded to go over the web, they can be restored for display to the user.
That is, only occurring in a label where the character would have to be
quoted in order to not have the label be terminated. We can discuss how to
handle the cases where they cannot be minimized; how sites can work around
it, whether the remaining cases represent a significant problem, and if so,
whether there is some alternative syntax that could be used.

Where the query string contains LTR characters, there are a couple of
choices. For most people, the query part is just technical gorp. And
websites are able to put whatever they want into those strings; their
interpretation is private to that site. So there are a couple of approaches
(at least):

   - Not really bother with it: if it contains LTR characters then it
   reorders in a funny way, but since it is technical gorp we don't care. A
   - Have some simple standardized way of mapping LTR characters in the
   query part into bidi characters that sites can use if they want to be wholly
   RTL.


*B. *As far as RLM/LRMs, they are relevant to the Specialized BIDI approach.
(As I said before, I have doubts as to whether this approach is viable, but
it is worth pursuing how it could be).

What we recommend in the UBA is that if people are going to override the
BIDI algorithm for any purpose, that they effectively do so by the insertion
of bidi controls (we should make that recommendation clearer, however). So
how would this play out with URLs?

   1. I type a URL into an address bar. Since the program is URL-aware*, it
   parses out the labels. Based on whatever standard mechanism is defined (eg
   the URL contains a RTL character), it is detected as a BIDI label, and
   ordered consistently. Effectively, that is done by inserting RLM at the
   start of each label that doesn't begin with a RTL character and at the end
   of each label that doesn't end with a RTL character. One could use the
   embedding codes, but they are more dangerous.
   2. This is the display form: when the URL is looked up, the RLMs have to
   be stripped before it is transformed into punycode and %escaped.
   3. If I cut or copy that URL, then the RLMs go with it into plain text on
   the clipboard.
   4. When I paste that address into plain text, it then appears in the same
   order as it was in the address bar.

Take another case:

   1. I see a URL in some plain text (whether or not it is consistently
   ordered), and cut and paste that plaintext URL into an address bar (or other
   URL-aware* program). In that case, the program *renormalizes* the URL.
   That is, it strips out all bidi controls, and then reapplies the BIDI
   detection and RLM insertion. I then end up with consistent ordering in the
   result.

Note that in no cases would we expect people to manually put in the RLMs.

By URL-aware*, I mean that not only is it able to parse out URLs, but it
also applies the special ordering. Initially, there are no such programs.
And there are many problems with this approach: the old URL-aware programs
would choke on the RLMs; old programs would behave differently from new
programs; &c.

Mark

Received on Tuesday, 25 May 2010 18:31:41 UTC