W3C home > Mailing lists > Public > public-iri@w3.org > May 2010

Re: [bidi] Re: Special ordering for BIDI URLs

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Wed, 26 May 2010 15:20:50 +0300
Message-ID: <AANLkTimRe_OVruyvCk9ZCgxywi-DoeWGCggIlatu1sss@mail.gmail.com>
To: Shawn Steele <Shawn.Steele@microsoft.com>
Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Mark Davis ☕ <mark@macchiato.com>, "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>
In an RTL context,
http://foo.com?א/http://bar.com<http://foo.com?%D7%90/http://bar.com>is
displayed as
http://bar.com/א?http://foo.com

On Tue, May 25, 2010 at 7:26 PM, Shawn Steele <Shawn.Steele@microsoft.com>wrote:

>  I'd like to see such an example?  Esp. if there were consistent bidi
> rules being applied.
>
> -Shawn
>  ------------------------------
> *From:* Aharon (Vladimir) Lanin [aharon@google.com]
> *Sent:* Tuesday, May 25, 2010 5:31 AM
> *To:* Martin J. Dürst
> *Cc:* Mark Davis ☕; public-iri@w3.org; bidi@unicode.org; Shawn Steele;
> Murray Sargent
> *Subject:* Re: [bidi] Re: Special ordering for BIDI URLs
>
>   > I personally don't think so. I think there is
> > an important distinction between "behind
> > the scenes" URIs/IRIs, long, complicated
> > stuff that is difficult to fathom even in
> > simple US-ASCII, and "front-side"
> > URIs/IRIs, short things that get put on
> > billboards and passed around on napkins
>
> True, but phishers don't make that distinction. It is all too easy to
> construct mixed-direction IRIs that at first glance look like they belong to
> a completely different domain. As long as users are used to IRIs doing funky
> things, they will not realize that there is something wrong when the IRI as
> it is displayed in the browser's address bar is radically different from the
> IRI as it was displayed in the spam e-mail they got. And ideally, we want
> such IRIs to be automatically flagged by browsers, etc.
>
> Aharon
>
>
> On Tue, May 25, 2010 at 1:03 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp
> > wrote:
>
>> Hello Aharon,
>>
>>
>> On 2010/05/25 17:07, Aharon (Vladimir) Lanin wrote:
>>
>>>  The best way to solve the problem that I can
>>>> think of can be done right now. Any significant
>>>> site that wants to support BIDI languages
>>>> should provide for the ability to have IRIs
>>>> with *all *RTL characters
>>>>
>>>
>>> This does not seem to be practical under the current URL escaping scheme,
>>> since the query string often needs to contain arbitrary-language data,
>>> e.g.
>>> a search string. Let's say that data happens to be Latin script. There is
>>> currently no way to encode it into RTL characters. Thus, to stay uniform,
>>> the whole IRI has to become LTR. This is probably a branding issue for
>>> the
>>> site, which prides itself on its RTL domain name. And having the URL
>>> switch
>>> from all-RTL to all-LTR, with a different domain name, when the user
>>> clicks
>>> on some link in the page is probably quite confusing for the user. So, to
>>> truly allow for all-RTL URIs, we need to extend URL escaping (%XX) to the
>>> RTL domain, perhaps by somehow allowing decimal escapes in addition to
>>> hexadecimal ones, or by allowing using the first six letters of the
>>> Hebrew
>>> and Arabic alphabets to be used to represent hexadecimal digits 10
>>> through
>>> 15.
>>>
>>
>>  When preparing the examples in the current IRI spec (RFC 3987), I noticed
>> that the '%' character's behavior is indeed rather. I don't remember the
>> details off my head, but I would like to ask you to carefully check the idea
>> with %-encoding and Arabic/Hebrew letters. It may work, but there might be
>> some weird effects, so that it doesn't work as expected. If it turns out to
>> work, it may be an interesting long-term addition. It would help getting
>> over the problem, described in RFC 3987 as far as I remember, that if
>> there's a single character in an RTL component that needs to be escaped, you
>> may have to escape all characters.
>>
>>
>>
>>  But even if such hurdles were overcome, and it would become *possible*
>>> for a
>>> site to phrase all the IRIs it requires without mixing LTR and RTL
>>> characters, this would only reduce user confusion. Third-party documents
>>> (including those originating with spoofers) would probably continue
>>> formulating mixed-direction IRIs that would display differently in
>>> different
>>> directional contexts, and sometimes seem like they belong to the site
>>> when
>>> in fact they don't.
>>>
>>> So, should sites be encouraged to stop accepting mixed-direction IRIs, so
>>> that they eventually become rare - and automatically suspect - on the
>>> web?
>>>
>>
>>  I personally don't think so. I think there is an important distinction
>> between "behind the scenes" URIs/IRIs, long, complicated stuff that is
>> difficult to fathom even in simple US-ASCII, and "front-side" URIs/IRIs,
>> short things that get put on billboards and passed around on napkins,...
>>
>>
>>
>>  (For a site to detect mixed-direction IRIs would not be trivial. For
>>> example, it will be hampered by having to include the domain name in the
>>> check. Non-ASCII domain names are translated into (ASCII) punicode before
>>> they get to the site. So, the site would have to first translate the
>>> punicode back to the original non-ASCII domain name before checking that
>>> the
>>> IRI does not contain both LTR and RTL characters.)
>>>
>>
>>  Given that in general, sites have to deal just with one or a few domain
>> names (some exceptions such as blogspot.com and the like will prove this
>> rule), this should be rather easy for most sites. On Apache, it may be
>> possible to do it with a few well-crafted rewrite engine rules.
>>
>>
>>
>>  Another alternative would be to use a limited
>>>> set of markup within URLs so as to preserve the
>>>> right ordering. It would suffice to allow RTM
>>>> and LTM characters around the neutral
>>>> characters.
>>>>
>>>
>>> (The intent here is LRM and RLM, I think.)
>>>
>>> This approach requires a mechanism for determining which LRMs and RLMs in
>>> an
>>> IRI are just optional "visual sugar" for the user, and should thus be
>>> removed before further processing of the IRI, and which are an integral
>>> part
>>> of the IRI.
>>>
>>
>>  Currently, the IRI spec doesn't allow any LRMs or RLMs, so at least on
>> the spec level, this isn't a problem. Any raw LRMs/RLMs would be "visual
>> sugar", any that are not visual sugar would have to be escaped. We would
>> have to check carefully to what extent this distinction survives various
>> operations on IRIs.
>>
>>
>> Regards,    Martin.
>>
>>
>>
>>  Such a mechanism would have to deal with the different nature of
>>> different parts of the IRI (e.g. domain name, path, and query string),
>>> and
>>> would likely affect many of the layers involved in the processing of an
>>> IRI:
>>> e.g. browsers (for LRMs and RLMs in the domain name before it is
>>> translated
>>> into punicode), HTTP web server software (for LRMs and RLMs in the path),
>>> and the site's final code layers that process the query string.
>>>
>>> Not trivial...
>>>
>>> Furthermore, we still have the same problem as above: that some documents
>>> containing IRIs will bother to use LRMs and RLMs in them does not mean
>>> that
>>> *all* documents will. (For example, it is difficult to imagine a user
>>> manually typing an IRI into an e-mail with LRMs or RLMs.) And thus, users
>>> will become used to seeing IRIs being displayed every which way, making
>>> spoofing that much easier. It is not clear to me that allowing the use of
>>> LRMs and RLMs in IRIs would reduce the problem or make it even larger.
>>>
>>> Aharon
>>>
>>> On Tue, May 25, 2010 at 3:10 AM, Mark Davis ☕<mark@macchiato.com>
>>>  wrote:
>>>
>>>  There has been some discussion of having a special ordering for BIDI
>>>> URLs
>>>> so that they are more understandable to users. (I'll use URL in the
>>>> broad
>>>> sense, as including non-ASCII characters.) This is a complicated issue,
>>>> and
>>>> I can't claim to have all the answers, but here are some thoughts on the
>>>> issue.
>>>>
>>>> In the Unicode consortium, we've been aware of this issue, and have
>>>> considered options a number of times over the years. However, we have
>>>> not
>>>> yet heard a good case for how supporting uniform field direction in URLs
>>>> can
>>>> be done without significant compatibility and security problems. There
>>>> are
>>>> some big stumbling blocks:
>>>>
>>>>    - Many clients that display URLs will either not be URL aware, or not
>>>>    be aware of the latest standard, or not be able to parse out text as
>>>>    definitively belonging to a URL.
>>>>    - The specs have no termination criteria for parsing URLs in plain
>>>>    text. So http://abc.def#ghi could be "http://abc.def#ghi" or could
>>>> be "
>>>>    http://abc.def#ghi* could*", since fragments can include spaces.
>>>> (And
>>>>    in languages that don't use spaces to separate words, this is further
>>>>    complicated.) Different applications have different heuristics for
>>>> this, but
>>>>    those heuristics don't always agree.
>>>>    - Many applications heuristically recognize fragment URLs, like "
>>>>    google.com". So in a broad sense, people understand a URL as
>>>> "something
>>>>    that I could paste into an address bar in my browser and will get me
>>>> to a
>>>>    page", and have the expectation that they will order similarly. That
>>>> is,
>>>>    ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would
>>>> be
>>>>    confusing.
>>>>
>>>> Why is ordering a problem? Suppose I have the URL http://ABC.DEF.
>>>> Currently, any application that displays BIDI will do it as either
>>>> http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one.
>>>> If
>>>> an application starts to display it as http://CBA.FED, then it
>>>> represents
>>>> a significant security problem, since the user will think it is the
>>>> different URL http://DEF.ABC. As long as there is significant
>>>> percentage
>>>> of old applications, there will be the opportunity for that problem. The
>>>> same goes for LTR URLs in a RTL environment.
>>>>
>>>> Moreover, if I paste text between applications, even where the paragraph
>>>> direction is constant, then the labels can flip in arbitrary ways if
>>>> some
>>>> applications support uniform direction and some don't. The challenge is
>>>> to
>>>> get all applications to consistently (a) be URL aware, and (b) all
>>>> switch to
>>>> some new display order in unison. It might be that someone can come up
>>>> with
>>>> a way to handle this, but we haven't heard of one yet.
>>>>
>>>> (Had the importance of URL syntax been known at the time the consortium
>>>> came up with the BIDI algorithm, and were the IRI syntax determinant
>>>> enough
>>>> that the termination could always be recognized, even in the midst of
>>>> plain
>>>> text, we'd be in a different world.)
>>>>
>>>> But we're not. The best way to solve the problem that I can think of can
>>>> be
>>>> done right now. Any significant site that wants to support BIDI
>>>> languages
>>>> should provide for the ability to have IRIs with *all *RTL characters:
>>>> host name, path, query, fragment. If all the pieces are RTL text (or
>>>> infixed
>>>> neutrals), than the display has a consistent direction in both RTL and
>>>> LTR
>>>> environment, no matter whether the application is URL-aware or not, and
>>>> users won't be confused. Now that the TLD can be RTL, I think there will
>>>> be
>>>> pressure for the sites to do that, since completely-RTL IRIs will work
>>>> much
>>>> better in all environments.
>>>>
>>>> [The one real remaining piece is the scheme; the IRI is still
>>>> understandable (though ugly) if it has to be ASCII, but it would be
>>>> somewhat
>>>> better if it could have a RTL alias.  (Pure digit fields like IP
>>>> addresses
>>>> are a bit ugly, but seldom used.)]
>>>>
>>>> Another alternative would be to use a limited set of markup within URLs
>>>> so
>>>> as to preserve the right ordering. It would suffice to allow RTM and LTM
>>>> characters around the neutral characters. Any BIDI URL could be
>>>> normalized
>>>> so as to include these characters in all and only the right places, by a
>>>> compliant implementation. And once this was done, then the text can be
>>>> cut
>>>> and copies between applications with no change in appearance.
>>>> However, one would come up with sufficient constraints on the use of
>>>> these
>>>> characters so as to prevent *their* being used for spoofing, and could
>>>> have a problem with breakage on older implementations. (Although in a
>>>> way,
>>>> breaking is better than sending people to the wrong place.)
>>>>
>>>> Mark
>>>>
>>>> — Il meglio è l’inimico del bene —
>>>>
>>>>
>>>
>>   --
>> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
>> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>>
>
>
Received on Wednesday, 26 May 2010 12:21:41 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:51:57 GMT