- From: Aharon (Vladimir) Lanin <aharon@google.com>
- Date: Wed, 26 May 2010 15:20:50 +0300
- To: Shawn Steele <Shawn.Steele@microsoft.com>
- Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Mark Davis ☕ <mark@macchiato.com>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>
- Message-ID: <AANLkTimRe_OVruyvCk9ZCgxywi-DoeWGCggIlatu1sss@mail.gmail.com>
In an RTL context, http://foo.com?א/http://bar.com<http://foo.com?%D7%90/http://bar.com>is displayed as http://bar.com/א?http://foo.com On Tue, May 25, 2010 at 7:26 PM, Shawn Steele <Shawn.Steele@microsoft.com>wrote: > I'd like to see such an example? Esp. if there were consistent bidi > rules being applied. > > -Shawn > ------------------------------ > *From:* Aharon (Vladimir) Lanin [aharon@google.com] > *Sent:* Tuesday, May 25, 2010 5:31 AM > *To:* Martin J. Dürst > *Cc:* Mark Davis ☕; public-iri@w3.org; bidi@unicode.org; Shawn Steele; > Murray Sargent > *Subject:* Re: [bidi] Re: Special ordering for BIDI URLs > > > I personally don't think so. I think there is > > an important distinction between "behind > > the scenes" URIs/IRIs, long, complicated > > stuff that is difficult to fathom even in > > simple US-ASCII, and "front-side" > > URIs/IRIs, short things that get put on > > billboards and passed around on napkins > > True, but phishers don't make that distinction. It is all too easy to > construct mixed-direction IRIs that at first glance look like they belong to > a completely different domain. As long as users are used to IRIs doing funky > things, they will not realize that there is something wrong when the IRI as > it is displayed in the browser's address bar is radically different from the > IRI as it was displayed in the spam e-mail they got. And ideally, we want > such IRIs to be automatically flagged by browsers, etc. > > Aharon > > > On Tue, May 25, 2010 at 1:03 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp > > wrote: > >> Hello Aharon, >> >> >> On 2010/05/25 17:07, Aharon (Vladimir) Lanin wrote: >> >>> The best way to solve the problem that I can >>>> think of can be done right now. Any significant >>>> site that wants to support BIDI languages >>>> should provide for the ability to have IRIs >>>> with *all *RTL characters >>>> >>> >>> This does not seem to be practical under the current URL escaping scheme, >>> since the query string often needs to contain arbitrary-language data, >>> e.g. >>> a search string. Let's say that data happens to be Latin script. There is >>> currently no way to encode it into RTL characters. Thus, to stay uniform, >>> the whole IRI has to become LTR. This is probably a branding issue for >>> the >>> site, which prides itself on its RTL domain name. And having the URL >>> switch >>> from all-RTL to all-LTR, with a different domain name, when the user >>> clicks >>> on some link in the page is probably quite confusing for the user. So, to >>> truly allow for all-RTL URIs, we need to extend URL escaping (%XX) to the >>> RTL domain, perhaps by somehow allowing decimal escapes in addition to >>> hexadecimal ones, or by allowing using the first six letters of the >>> Hebrew >>> and Arabic alphabets to be used to represent hexadecimal digits 10 >>> through >>> 15. >>> >> >> When preparing the examples in the current IRI spec (RFC 3987), I noticed >> that the '%' character's behavior is indeed rather. I don't remember the >> details off my head, but I would like to ask you to carefully check the idea >> with %-encoding and Arabic/Hebrew letters. It may work, but there might be >> some weird effects, so that it doesn't work as expected. If it turns out to >> work, it may be an interesting long-term addition. It would help getting >> over the problem, described in RFC 3987 as far as I remember, that if >> there's a single character in an RTL component that needs to be escaped, you >> may have to escape all characters. >> >> >> >> But even if such hurdles were overcome, and it would become *possible* >>> for a >>> site to phrase all the IRIs it requires without mixing LTR and RTL >>> characters, this would only reduce user confusion. Third-party documents >>> (including those originating with spoofers) would probably continue >>> formulating mixed-direction IRIs that would display differently in >>> different >>> directional contexts, and sometimes seem like they belong to the site >>> when >>> in fact they don't. >>> >>> So, should sites be encouraged to stop accepting mixed-direction IRIs, so >>> that they eventually become rare - and automatically suspect - on the >>> web? >>> >> >> I personally don't think so. I think there is an important distinction >> between "behind the scenes" URIs/IRIs, long, complicated stuff that is >> difficult to fathom even in simple US-ASCII, and "front-side" URIs/IRIs, >> short things that get put on billboards and passed around on napkins,... >> >> >> >> (For a site to detect mixed-direction IRIs would not be trivial. For >>> example, it will be hampered by having to include the domain name in the >>> check. Non-ASCII domain names are translated into (ASCII) punicode before >>> they get to the site. So, the site would have to first translate the >>> punicode back to the original non-ASCII domain name before checking that >>> the >>> IRI does not contain both LTR and RTL characters.) >>> >> >> Given that in general, sites have to deal just with one or a few domain >> names (some exceptions such as blogspot.com and the like will prove this >> rule), this should be rather easy for most sites. On Apache, it may be >> possible to do it with a few well-crafted rewrite engine rules. >> >> >> >> Another alternative would be to use a limited >>>> set of markup within URLs so as to preserve the >>>> right ordering. It would suffice to allow RTM >>>> and LTM characters around the neutral >>>> characters. >>>> >>> >>> (The intent here is LRM and RLM, I think.) >>> >>> This approach requires a mechanism for determining which LRMs and RLMs in >>> an >>> IRI are just optional "visual sugar" for the user, and should thus be >>> removed before further processing of the IRI, and which are an integral >>> part >>> of the IRI. >>> >> >> Currently, the IRI spec doesn't allow any LRMs or RLMs, so at least on >> the spec level, this isn't a problem. Any raw LRMs/RLMs would be "visual >> sugar", any that are not visual sugar would have to be escaped. We would >> have to check carefully to what extent this distinction survives various >> operations on IRIs. >> >> >> Regards, Martin. >> >> >> >> Such a mechanism would have to deal with the different nature of >>> different parts of the IRI (e.g. domain name, path, and query string), >>> and >>> would likely affect many of the layers involved in the processing of an >>> IRI: >>> e.g. browsers (for LRMs and RLMs in the domain name before it is >>> translated >>> into punicode), HTTP web server software (for LRMs and RLMs in the path), >>> and the site's final code layers that process the query string. >>> >>> Not trivial... >>> >>> Furthermore, we still have the same problem as above: that some documents >>> containing IRIs will bother to use LRMs and RLMs in them does not mean >>> that >>> *all* documents will. (For example, it is difficult to imagine a user >>> manually typing an IRI into an e-mail with LRMs or RLMs.) And thus, users >>> will become used to seeing IRIs being displayed every which way, making >>> spoofing that much easier. It is not clear to me that allowing the use of >>> LRMs and RLMs in IRIs would reduce the problem or make it even larger. >>> >>> Aharon >>> >>> On Tue, May 25, 2010 at 3:10 AM, Mark Davis ☕<mark@macchiato.com> >>> wrote: >>> >>> There has been some discussion of having a special ordering for BIDI >>>> URLs >>>> so that they are more understandable to users. (I'll use URL in the >>>> broad >>>> sense, as including non-ASCII characters.) This is a complicated issue, >>>> and >>>> I can't claim to have all the answers, but here are some thoughts on the >>>> issue. >>>> >>>> In the Unicode consortium, we've been aware of this issue, and have >>>> considered options a number of times over the years. However, we have >>>> not >>>> yet heard a good case for how supporting uniform field direction in URLs >>>> can >>>> be done without significant compatibility and security problems. There >>>> are >>>> some big stumbling blocks: >>>> >>>> - Many clients that display URLs will either not be URL aware, or not >>>> be aware of the latest standard, or not be able to parse out text as >>>> definitively belonging to a URL. >>>> - The specs have no termination criteria for parsing URLs in plain >>>> text. So http://abc.def#ghi could be "http://abc.def#ghi" or could >>>> be " >>>> http://abc.def#ghi* could*", since fragments can include spaces. >>>> (And >>>> in languages that don't use spaces to separate words, this is further >>>> complicated.) Different applications have different heuristics for >>>> this, but >>>> those heuristics don't always agree. >>>> - Many applications heuristically recognize fragment URLs, like " >>>> google.com". So in a broad sense, people understand a URL as >>>> "something >>>> that I could paste into an address bar in my browser and will get me >>>> to a >>>> page", and have the expectation that they will order similarly. That >>>> is, >>>> ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would >>>> be >>>> confusing. >>>> >>>> Why is ordering a problem? Suppose I have the URL http://ABC.DEF. >>>> Currently, any application that displays BIDI will do it as either >>>> http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one. >>>> If >>>> an application starts to display it as http://CBA.FED, then it >>>> represents >>>> a significant security problem, since the user will think it is the >>>> different URL http://DEF.ABC. As long as there is significant >>>> percentage >>>> of old applications, there will be the opportunity for that problem. The >>>> same goes for LTR URLs in a RTL environment. >>>> >>>> Moreover, if I paste text between applications, even where the paragraph >>>> direction is constant, then the labels can flip in arbitrary ways if >>>> some >>>> applications support uniform direction and some don't. The challenge is >>>> to >>>> get all applications to consistently (a) be URL aware, and (b) all >>>> switch to >>>> some new display order in unison. It might be that someone can come up >>>> with >>>> a way to handle this, but we haven't heard of one yet. >>>> >>>> (Had the importance of URL syntax been known at the time the consortium >>>> came up with the BIDI algorithm, and were the IRI syntax determinant >>>> enough >>>> that the termination could always be recognized, even in the midst of >>>> plain >>>> text, we'd be in a different world.) >>>> >>>> But we're not. The best way to solve the problem that I can think of can >>>> be >>>> done right now. Any significant site that wants to support BIDI >>>> languages >>>> should provide for the ability to have IRIs with *all *RTL characters: >>>> host name, path, query, fragment. If all the pieces are RTL text (or >>>> infixed >>>> neutrals), than the display has a consistent direction in both RTL and >>>> LTR >>>> environment, no matter whether the application is URL-aware or not, and >>>> users won't be confused. Now that the TLD can be RTL, I think there will >>>> be >>>> pressure for the sites to do that, since completely-RTL IRIs will work >>>> much >>>> better in all environments. >>>> >>>> [The one real remaining piece is the scheme; the IRI is still >>>> understandable (though ugly) if it has to be ASCII, but it would be >>>> somewhat >>>> better if it could have a RTL alias. (Pure digit fields like IP >>>> addresses >>>> are a bit ugly, but seldom used.)] >>>> >>>> Another alternative would be to use a limited set of markup within URLs >>>> so >>>> as to preserve the right ordering. It would suffice to allow RTM and LTM >>>> characters around the neutral characters. Any BIDI URL could be >>>> normalized >>>> so as to include these characters in all and only the right places, by a >>>> compliant implementation. And once this was done, then the text can be >>>> cut >>>> and copies between applications with no change in appearance. >>>> However, one would come up with sufficient constraints on the use of >>>> these >>>> characters so as to prevent *their* being used for spoofing, and could >>>> have a problem with breakage on older implementations. (Although in a >>>> way, >>>> breaking is better than sending people to the wrong place.) >>>> >>>> Mark >>>> >>>> — Il meglio è l’inimico del bene — >>>> >>>> >>> >> -- >> #-# Martin J. Dürst, Professor, Aoyama Gakuin University >> #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp >> > >
Received on Wednesday, 26 May 2010 12:21:41 UTC