- From: Aharon (Vladimir) Lanin <aharon@google.com>
- Date: Tue, 25 May 2010 14:14:44 -0400
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Mark Davis ☕ <mark@macchiato.com>, public-iri@w3.org, bidi@unicode.org, Shawn Steele <Shawn.Steele@microsoft.com>, Murray Sargent <murrays@exchange.microsoft.com>
- Message-ID: <AANLkTiksYtj0nlzpKDvIWmZNYhak3boAtfyY0RFXPXJh@mail.gmail.com>
> When preparing the examples in the current > IRI spec (RFC 3987), I noticed that the '%' > character's behavior is indeed rather [quirky?] You are indeed sadly correct. In an RTL context, "FOO.COM/%41%42" is displayed as "%41%42/MOC.OOF" if FOO.COM is Hebrew and as "42%41%/MOC.OOF" if FOO.COM is Arabic. The Hebrew variant suffers from classic bidi-itis. Thus %-escaping, even with the proposed addition of the first six Hebrew and Arabic letters, is a problem. A new escaping scheme that does not use % would have to be introduced to allow all-RTL URLs. Aharon On Tue, May 25, 2010 at 1:03 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote: > Hello Aharon, > > > On 2010/05/25 17:07, Aharon (Vladimir) Lanin wrote: > >> The best way to solve the problem that I can >>> think of can be done right now. Any significant >>> site that wants to support BIDI languages >>> should provide for the ability to have IRIs >>> with *all *RTL characters >>> >> >> This does not seem to be practical under the current URL escaping scheme, >> since the query string often needs to contain arbitrary-language data, >> e.g. >> a search string. Let's say that data happens to be Latin script. There is >> currently no way to encode it into RTL characters. Thus, to stay uniform, >> the whole IRI has to become LTR. This is probably a branding issue for the >> site, which prides itself on its RTL domain name. And having the URL >> switch >> from all-RTL to all-LTR, with a different domain name, when the user >> clicks >> on some link in the page is probably quite confusing for the user. So, to >> truly allow for all-RTL URIs, we need to extend URL escaping (%XX) to the >> RTL domain, perhaps by somehow allowing decimal escapes in addition to >> hexadecimal ones, or by allowing using the first six letters of the Hebrew >> and Arabic alphabets to be used to represent hexadecimal digits 10 through >> 15. >> > > When preparing the examples in the current IRI spec (RFC 3987), I noticed > that the '%' character's behavior is indeed rather. I don't remember the > details off my head, but I would like to ask you to carefully check the idea > with %-encoding and Arabic/Hebrew letters. It may work, but there might be > some weird effects, so that it doesn't work as expected. If it turns out to > work, it may be an interesting long-term addition. It would help getting > over the problem, described in RFC 3987 as far as I remember, that if > there's a single character in an RTL component that needs to be escaped, you > may have to escape all characters. > > > > But even if such hurdles were overcome, and it would become *possible* for >> a >> site to phrase all the IRIs it requires without mixing LTR and RTL >> characters, this would only reduce user confusion. Third-party documents >> (including those originating with spoofers) would probably continue >> formulating mixed-direction IRIs that would display differently in >> different >> directional contexts, and sometimes seem like they belong to the site when >> in fact they don't. >> >> So, should sites be encouraged to stop accepting mixed-direction IRIs, so >> that they eventually become rare - and automatically suspect - on the web? >> > > I personally don't think so. I think there is an important distinction > between "behind the scenes" URIs/IRIs, long, complicated stuff that is > difficult to fathom even in simple US-ASCII, and "front-side" URIs/IRIs, > short things that get put on billboards and passed around on napkins,... > > > > (For a site to detect mixed-direction IRIs would not be trivial. For >> example, it will be hampered by having to include the domain name in the >> check. Non-ASCII domain names are translated into (ASCII) punicode before >> they get to the site. So, the site would have to first translate the >> punicode back to the original non-ASCII domain name before checking that >> the >> IRI does not contain both LTR and RTL characters.) >> > > Given that in general, sites have to deal just with one or a few domain > names (some exceptions such as blogspot.com and the like will prove this > rule), this should be rather easy for most sites. On Apache, it may be > possible to do it with a few well-crafted rewrite engine rules. > > > > Another alternative would be to use a limited >>> set of markup within URLs so as to preserve the >>> right ordering. It would suffice to allow RTM >>> and LTM characters around the neutral >>> characters. >>> >> >> (The intent here is LRM and RLM, I think.) >> >> This approach requires a mechanism for determining which LRMs and RLMs in >> an >> IRI are just optional "visual sugar" for the user, and should thus be >> removed before further processing of the IRI, and which are an integral >> part >> of the IRI. >> > > Currently, the IRI spec doesn't allow any LRMs or RLMs, so at least on the > spec level, this isn't a problem. Any raw LRMs/RLMs would be "visual sugar", > any that are not visual sugar would have to be escaped. We would have to > check carefully to what extent this distinction survives various operations > on IRIs. > > > Regards, Martin. > > > > Such a mechanism would have to deal with the different nature of >> different parts of the IRI (e.g. domain name, path, and query string), and >> would likely affect many of the layers involved in the processing of an >> IRI: >> e.g. browsers (for LRMs and RLMs in the domain name before it is >> translated >> into punicode), HTTP web server software (for LRMs and RLMs in the path), >> and the site's final code layers that process the query string. >> >> Not trivial... >> >> Furthermore, we still have the same problem as above: that some documents >> containing IRIs will bother to use LRMs and RLMs in them does not mean >> that >> *all* documents will. (For example, it is difficult to imagine a user >> manually typing an IRI into an e-mail with LRMs or RLMs.) And thus, users >> will become used to seeing IRIs being displayed every which way, making >> spoofing that much easier. It is not clear to me that allowing the use of >> LRMs and RLMs in IRIs would reduce the problem or make it even larger. >> >> Aharon >> >> On Tue, May 25, 2010 at 3:10 AM, Mark Davis ☕<mark@macchiato.com> wrote: >> >> There has been some discussion of having a special ordering for BIDI URLs >>> so that they are more understandable to users. (I'll use URL in the broad >>> sense, as including non-ASCII characters.) This is a complicated issue, >>> and >>> I can't claim to have all the answers, but here are some thoughts on the >>> issue. >>> >>> In the Unicode consortium, we've been aware of this issue, and have >>> considered options a number of times over the years. However, we have not >>> yet heard a good case for how supporting uniform field direction in URLs >>> can >>> be done without significant compatibility and security problems. There >>> are >>> some big stumbling blocks: >>> >>> - Many clients that display URLs will either not be URL aware, or not >>> be aware of the latest standard, or not be able to parse out text as >>> definitively belonging to a URL. >>> - The specs have no termination criteria for parsing URLs in plain >>> text. So http://abc.def#ghi could be "http://abc.def#ghi" or could be >>> " >>> http://abc.def#ghi* could*", since fragments can include spaces. (And >>> in languages that don't use spaces to separate words, this is further >>> complicated.) Different applications have different heuristics for >>> this, but >>> those heuristics don't always agree. >>> - Many applications heuristically recognize fragment URLs, like " >>> google.com". So in a broad sense, people understand a URL as >>> "something >>> that I could paste into an address bar in my browser and will get me >>> to a >>> page", and have the expectation that they will order similarly. That >>> is, >>> ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would >>> be >>> confusing. >>> >>> Why is ordering a problem? Suppose I have the URL http://ABC.DEF. >>> Currently, any application that displays BIDI will do it as either >>> http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one. >>> If >>> an application starts to display it as http://CBA.FED, then it >>> represents >>> a significant security problem, since the user will think it is the >>> different URL http://DEF.ABC. As long as there is significant percentage >>> of old applications, there will be the opportunity for that problem. The >>> same goes for LTR URLs in a RTL environment. >>> >>> Moreover, if I paste text between applications, even where the paragraph >>> direction is constant, then the labels can flip in arbitrary ways if some >>> applications support uniform direction and some don't. The challenge is >>> to >>> get all applications to consistently (a) be URL aware, and (b) all switch >>> to >>> some new display order in unison. It might be that someone can come up >>> with >>> a way to handle this, but we haven't heard of one yet. >>> >>> (Had the importance of URL syntax been known at the time the consortium >>> came up with the BIDI algorithm, and were the IRI syntax determinant >>> enough >>> that the termination could always be recognized, even in the midst of >>> plain >>> text, we'd be in a different world.) >>> >>> But we're not. The best way to solve the problem that I can think of can >>> be >>> done right now. Any significant site that wants to support BIDI languages >>> should provide for the ability to have IRIs with *all *RTL characters: >>> host name, path, query, fragment. If all the pieces are RTL text (or >>> infixed >>> neutrals), than the display has a consistent direction in both RTL and >>> LTR >>> environment, no matter whether the application is URL-aware or not, and >>> users won't be confused. Now that the TLD can be RTL, I think there will >>> be >>> pressure for the sites to do that, since completely-RTL IRIs will work >>> much >>> better in all environments. >>> >>> [The one real remaining piece is the scheme; the IRI is still >>> understandable (though ugly) if it has to be ASCII, but it would be >>> somewhat >>> better if it could have a RTL alias. (Pure digit fields like IP >>> addresses >>> are a bit ugly, but seldom used.)] >>> >>> Another alternative would be to use a limited set of markup within URLs >>> so >>> as to preserve the right ordering. It would suffice to allow RTM and LTM >>> characters around the neutral characters. Any BIDI URL could be >>> normalized >>> so as to include these characters in all and only the right places, by a >>> compliant implementation. And once this was done, then the text can be >>> cut >>> and copies between applications with no change in appearance. >>> However, one would come up with sufficient constraints on the use of >>> these >>> characters so as to prevent *their* being used for spoofing, and could >>> have a problem with breakage on older implementations. (Although in a >>> way, >>> breaking is better than sending people to the wrong place.) >>> >>> Mark >>> >>> — Il meglio è l’inimico del bene — >>> >>> >> > -- > #-# Martin J. Dürst, Professor, Aoyama Gakuin University > #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp >
Received on Tuesday, 25 May 2010 18:14:48 UTC