- From: Mark Davis ☕ <mark@macchiato.com>
- Date: Tue, 25 May 2010 11:31:04 -0700
- To: "Aharon (Vladimir) Lanin" <aharon@google.com>
- Cc: Shawn Steele <Shawn.Steele@microsoft.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>
- Message-ID: <AANLkTimcXSdQYBg_LYqqumCKshA4IOFGuRmwUCgggvXy@mail.gmail.com>
It looks like we are having some useful discussions. Let me try to clarify a bit of what I said. My original message was getting longish, and I know people's eyes glaze when it gets too long, so I think I wasn't clear on a couple of matters. At a high level, there are two choices (as far as I know): *1. Market Forces.* Make it possible for URLs (actually IRIs) to be completely RTL, and push sites and programs to use them. Note that part of this can be adding mechanisms to URL-aware programs to flag to users when BIDI reordering is changing the order of labels, such as flagging them with a special format. *2. Specialized BIDI. *Force a consistent order on URLs, using a higher-level protocol on top of the UBA. You mention %, which is relevant to #1 and RLM/LRMs, which are relevant to #2. *A. *As far as % goes, what that means is that every label can be constructed so as to contain no LTR characters. By "label", I mean in a broad sense, so each of the three letter sequences below counts as a label. http://abc.def.ghi/jkl/mno?pqr=stu&vwx=yza#bcd (The scheme is an exception: it has problems that Martin and John point out, but if that alone is LTR, it is not too bad; people can handle that being reordered if it is limited to it.) The % is an issue, although in an ideal world its use would be minimized in what the user sees. Although the characters have to be % encoded or punycoded to go over the web, they can be restored for display to the user. That is, only occurring in a label where the character would have to be quoted in order to not have the label be terminated. We can discuss how to handle the cases where they cannot be minimized; how sites can work around it, whether the remaining cases represent a significant problem, and if so, whether there is some alternative syntax that could be used. Where the query string contains LTR characters, there are a couple of choices. For most people, the query part is just technical gorp. And websites are able to put whatever they want into those strings; their interpretation is private to that site. So there are a couple of approaches (at least): - Not really bother with it: if it contains LTR characters then it reorders in a funny way, but since it is technical gorp we don't care. A - Have some simple standardized way of mapping LTR characters in the query part into bidi characters that sites can use if they want to be wholly RTL. *B. *As far as RLM/LRMs, they are relevant to the Specialized BIDI approach. (As I said before, I have doubts as to whether this approach is viable, but it is worth pursuing how it could be). What we recommend in the UBA is that if people are going to override the BIDI algorithm for any purpose, that they effectively do so by the insertion of bidi controls (we should make that recommendation clearer, however). So how would this play out with URLs? 1. I type a URL into an address bar. Since the program is URL-aware*, it parses out the labels. Based on whatever standard mechanism is defined (eg the URL contains a RTL character), it is detected as a BIDI label, and ordered consistently. Effectively, that is done by inserting RLM at the start of each label that doesn't begin with a RTL character and at the end of each label that doesn't end with a RTL character. One could use the embedding codes, but they are more dangerous. 2. This is the display form: when the URL is looked up, the RLMs have to be stripped before it is transformed into punycode and %escaped. 3. If I cut or copy that URL, then the RLMs go with it into plain text on the clipboard. 4. When I paste that address into plain text, it then appears in the same order as it was in the address bar. Take another case: 1. I see a URL in some plain text (whether or not it is consistently ordered), and cut and paste that plaintext URL into an address bar (or other URL-aware* program). In that case, the program *renormalizes* the URL. That is, it strips out all bidi controls, and then reapplies the BIDI detection and RLM insertion. I then end up with consistent ordering in the result. Note that in no cases would we expect people to manually put in the RLMs. By URL-aware*, I mean that not only is it able to parse out URLs, but it also applies the special ordering. Initially, there are no such programs. And there are many problems with this approach: the old URL-aware programs would choke on the RLMs; old programs would behave differently from new programs; &c. Mark
Received on Tuesday, 25 May 2010 18:31:41 UTC