Re: [bidi] Re: Special ordering for BIDI URLs from Aharon (Vladimir) Lanin on 2010-05-25 (public-iri@w3.org from May 2010)

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Tue, 25 May 2010 14:14:57 -0400
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Mark Davis ☕ <mark@macchiato.com>, public-iri@w3.org, bidi@unicode.org, Shawn Steele <Shawn.Steele@microsoft.com>, Murray Sargent <murrays@exchange.microsoft.com>
Message-ID: <AANLkTilt5LROn_EpcIxf34ngT0UsTeeWnlCQMMVGHp7B@mail.gmail.com>
> I personally don't think so. I think there is
> an important distinction between "behind
> the scenes" URIs/IRIs, long, complicated
> stuff that is difficult to fathom even in
> simple US-ASCII, and "front-side"
> URIs/IRIs, short things that get put on
> billboards and passed around on napkins

True, but phishers don't make that distinction. It is all too easy to
construct mixed-direction IRIs that at first glance look like they belong to
a completely different domain. As long as users are used to IRIs doing funky
things, they will not realize that there is something wrong when the IRI as
it is displayed in the browser's address bar is radically different from the
IRI as it was displayed in the spam e-mail they got. And ideally, we want
such IRIs to be automatically flagged by browsers, etc.

Aharon


On Tue, May 25, 2010 at 1:03 PM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp>wrote:

> Hello Aharon,
>
>
> On 2010/05/25 17:07, Aharon (Vladimir) Lanin wrote:
>
>> The best way to solve the problem that I can
>>> think of can be done right now. Any significant
>>> site that wants to support BIDI languages
>>> should provide for the ability to have IRIs
>>> with *all *RTL characters
>>>
>>
>> This does not seem to be practical under the current URL escaping scheme,
>> since the query string often needs to contain arbitrary-language data,
>> e.g.
>> a search string. Let's say that data happens to be Latin script. There is
>> currently no way to encode it into RTL characters. Thus, to stay uniform,
>> the whole IRI has to become LTR. This is probably a branding issue for the
>> site, which prides itself on its RTL domain name. And having the URL
>> switch
>> from all-RTL to all-LTR, with a different domain name, when the user
>> clicks
>> on some link in the page is probably quite confusing for the user. So, to
>> truly allow for all-RTL URIs, we need to extend URL escaping (%XX) to the
>> RTL domain, perhaps by somehow allowing decimal escapes in addition to
>> hexadecimal ones, or by allowing using the first six letters of the Hebrew
>> and Arabic alphabets to be used to represent hexadecimal digits 10 through
>> 15.
>>
>
> When preparing the examples in the current IRI spec (RFC 3987), I noticed
> that the '%' character's behavior is indeed rather. I don't remember the
> details off my head, but I would like to ask you to carefully check the idea
> with %-encoding and Arabic/Hebrew letters. It may work, but there might be
> some weird effects, so that it doesn't work as expected. If it turns out to
> work, it may be an interesting long-term addition. It would help getting
> over the problem, described in RFC 3987 as far as I remember, that if
> there's a single character in an RTL component that needs to be escaped, you
> may have to escape all characters.
>
>
>
>  But even if such hurdles were overcome, and it would become *possible* for
>> a
>> site to phrase all the IRIs it requires without mixing LTR and RTL
>> characters, this would only reduce user confusion. Third-party documents
>> (including those originating with spoofers) would probably continue
>> formulating mixed-direction IRIs that would display differently in
>> different
>> directional contexts, and sometimes seem like they belong to the site when
>> in fact they don't.
>>
>> So, should sites be encouraged to stop accepting mixed-direction IRIs, so
>> that they eventually become rare - and automatically suspect - on the web?
>>
>
> I personally don't think so. I think there is an important distinction
> between "behind the scenes" URIs/IRIs, long, complicated stuff that is
> difficult to fathom even in simple US-ASCII, and "front-side" URIs/IRIs,
> short things that get put on billboards and passed around on napkins,...
>
>
>
>  (For a site to detect mixed-direction IRIs would not be trivial. For
>> example, it will be hampered by having to include the domain name in the
>> check. Non-ASCII domain names are translated into (ASCII) punicode before
>> they get to the site. So, the site would have to first translate the
>> punicode back to the original non-ASCII domain name before checking that
>> the
>> IRI does not contain both LTR and RTL characters.)
>>
>
> Given that in general, sites have to deal just with one or a few domain
> names (some exceptions such as blogspot.com and the like will prove this
> rule), this should be rather easy for most sites. On Apache, it may be
> possible to do it with a few well-crafted rewrite engine rules.
>
>
>
>  Another alternative would be to use a limited
>>> set of markup within URLs so as to preserve the
>>> right ordering. It would suffice to allow RTM
>>> and LTM characters around the neutral
>>> characters.
>>>
>>
>> (The intent here is LRM and RLM, I think.)
>>
>> This approach requires a mechanism for determining which LRMs and RLMs in
>> an
>> IRI are just optional "visual sugar" for the user, and should thus be
>> removed before further processing of the IRI, and which are an integral
>> part
>> of the IRI.
>>
>
> Currently, the IRI spec doesn't allow any LRMs or RLMs, so at least on the
> spec level, this isn't a problem. Any raw LRMs/RLMs would be "visual sugar",
> any that are not visual sugar would have to be escaped. We would have to
> check carefully to what extent this distinction survives various operations
> on IRIs.
>
>
> Regards,    Martin.
>
>
>
>  Such a mechanism would have to deal with the different nature of
>> different parts of the IRI (e.g. domain name, path, and query string), and
>> would likely affect many of the layers involved in the processing of an
>> IRI:
>> e.g. browsers (for LRMs and RLMs in the domain name before it is
>> translated
>> into punicode), HTTP web server software (for LRMs and RLMs in the path),
>> and the site's final code layers that process the query string.
>>
>> Not trivial...
>>
>> Furthermore, we still have the same problem as above: that some documents
>> containing IRIs will bother to use LRMs and RLMs in them does not mean
>> that
>> *all* documents will. (For example, it is difficult to imagine a user
>> manually typing an IRI into an e-mail with LRMs or RLMs.) And thus, users
>> will become used to seeing IRIs being displayed every which way, making
>> spoofing that much easier. It is not clear to me that allowing the use of
>> LRMs and RLMs in IRIs would reduce the problem or make it even larger.
>>
>> Aharon
>>
>> On Tue, May 25, 2010 at 3:10 AM, Mark Davis ☕<mark@macchiato.com>  wrote:
>>
>>  There has been some discussion of having a special ordering for BIDI URLs
>>> so that they are more understandable to users. (I'll use URL in the broad
>>> sense, as including non-ASCII characters.) This is a complicated issue,
>>> and
>>> I can't claim to have all the answers, but here are some thoughts on the
>>> issue.
>>>
>>> In the Unicode consortium, we've been aware of this issue, and have
>>> considered options a number of times over the years. However, we have not
>>> yet heard a good case for how supporting uniform field direction in URLs
>>> can
>>> be done without significant compatibility and security problems. There
>>> are
>>> some big stumbling blocks:
>>>
>>>    - Many clients that display URLs will either not be URL aware, or not
>>>    be aware of the latest standard, or not be able to parse out text as
>>>    definitively belonging to a URL.
>>>    - The specs have no termination criteria for parsing URLs in plain
>>>    text. So http://abc.def#ghi could be "http://abc.def#ghi" or could be
>>> "
>>>    http://abc.def#ghi* could*", since fragments can include spaces. (And
>>>    in languages that don't use spaces to separate words, this is further
>>>    complicated.) Different applications have different heuristics for
>>> this, but
>>>    those heuristics don't always agree.
>>>    - Many applications heuristically recognize fragment URLs, like "
>>>    google.com". So in a broad sense, people understand a URL as
>>> "something
>>>    that I could paste into an address bar in my browser and will get me
>>> to a
>>>    page", and have the expectation that they will order similarly. That
>>> is,
>>>    ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would
>>> be
>>>    confusing.
>>>
>>> Why is ordering a problem? Suppose I have the URL http://ABC.DEF.
>>> Currently, any application that displays BIDI will do it as either
>>> http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one.
>>> If
>>> an application starts to display it as http://CBA.FED, then it
>>> represents
>>> a significant security problem, since the user will think it is the
>>> different URL http://DEF.ABC. As long as there is significant percentage
>>> of old applications, there will be the opportunity for that problem. The
>>> same goes for LTR URLs in a RTL environment.
>>>
>>> Moreover, if I paste text between applications, even where the paragraph
>>> direction is constant, then the labels can flip in arbitrary ways if some
>>> applications support uniform direction and some don't. The challenge is
>>> to
>>> get all applications to consistently (a) be URL aware, and (b) all switch
>>> to
>>> some new display order in unison. It might be that someone can come up
>>> with
>>> a way to handle this, but we haven't heard of one yet.
>>>
>>> (Had the importance of URL syntax been known at the time the consortium
>>> came up with the BIDI algorithm, and were the IRI syntax determinant
>>> enough
>>> that the termination could always be recognized, even in the midst of
>>> plain
>>> text, we'd be in a different world.)
>>>
>>> But we're not. The best way to solve the problem that I can think of can
>>> be
>>> done right now. Any significant site that wants to support BIDI languages
>>> should provide for the ability to have IRIs with *all *RTL characters:
>>> host name, path, query, fragment. If all the pieces are RTL text (or
>>> infixed
>>> neutrals), than the display has a consistent direction in both RTL and
>>> LTR
>>> environment, no matter whether the application is URL-aware or not, and
>>> users won't be confused. Now that the TLD can be RTL, I think there will
>>> be
>>> pressure for the sites to do that, since completely-RTL IRIs will work
>>> much
>>> better in all environments.
>>>
>>> [The one real remaining piece is the scheme; the IRI is still
>>> understandable (though ugly) if it has to be ASCII, but it would be
>>> somewhat
>>> better if it could have a RTL alias.  (Pure digit fields like IP
>>> addresses
>>> are a bit ugly, but seldom used.)]
>>>
>>> Another alternative would be to use a limited set of markup within URLs
>>> so
>>> as to preserve the right ordering. It would suffice to allow RTM and LTM
>>> characters around the neutral characters. Any BIDI URL could be
>>> normalized
>>> so as to include these characters in all and only the right places, by a
>>> compliant implementation. And once this was done, then the text can be
>>> cut
>>> and copies between applications with no change in appearance.
>>> However, one would come up with sufficient constraints on the use of
>>> these
>>> characters so as to prevent *their* being used for spoofing, and could
>>> have a problem with breakage on older implementations. (Although in a
>>> way,
>>> breaking is better than sending people to the wrong place.)
>>>
>>> Mark
>>>
>>> — Il meglio è l’inimico del bene —
>>>
>>>
>>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
Received on Tuesday, 25 May 2010 18:15:00 UTC