Re: [bidi] Re: Special ordering for BIDI URLs from Aharon (Vladimir) Lanin on 2010-05-25 (public-iri@w3.org from May 2010)

From: Aharon (Vladimir) Lanin <aharon@google.com>
Date: Tue, 25 May 2010 14:14:44 -0400
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Mark Davis ☕ <mark@macchiato.com>, public-iri@w3.org, bidi@unicode.org, Shawn Steele <Shawn.Steele@microsoft.com>, Murray Sargent <murrays@exchange.microsoft.com>
Message-ID: <AANLkTiksYtj0nlzpKDvIWmZNYhak3boAtfyY0RFXPXJh@mail.gmail.com>
> When preparing the examples in the current
> IRI spec (RFC 3987), I noticed that the '%'
> character's behavior is indeed rather [quirky?]

You are indeed sadly correct.

In an RTL context, "FOO.COM/%41%42"
is displayed as "%41%42/MOC.OOF" if FOO.COM is Hebrew
and as "42%41%/MOC.OOF" if FOO.COM is Arabic.

The Hebrew variant suffers from classic bidi-itis.

Thus %-escaping, even with the proposed addition of the first six Hebrew and
Arabic letters, is a problem. A new escaping scheme that does not use %
would have to be introduced to allow all-RTL URLs.

Aharon

On Tue, May 25, 2010 at 1:03 PM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp>wrote:

> Hello Aharon,
>
>
> On 2010/05/25 17:07, Aharon (Vladimir) Lanin wrote:
>
>> The best way to solve the problem that I can
>>> think of can be done right now. Any significant
>>> site that wants to support BIDI languages
>>> should provide for the ability to have IRIs
>>> with *all *RTL characters
>>>
>>
>> This does not seem to be practical under the current URL escaping scheme,
>> since the query string often needs to contain arbitrary-language data,
>> e.g.
>> a search string. Let's say that data happens to be Latin script. There is
>> currently no way to encode it into RTL characters. Thus, to stay uniform,
>> the whole IRI has to become LTR. This is probably a branding issue for the
>> site, which prides itself on its RTL domain name. And having the URL
>> switch
>> from all-RTL to all-LTR, with a different domain name, when the user
>> clicks
>> on some link in the page is probably quite confusing for the user. So, to
>> truly allow for all-RTL URIs, we need to extend URL escaping (%XX) to the
>> RTL domain, perhaps by somehow allowing decimal escapes in addition to
>> hexadecimal ones, or by allowing using the first six letters of the Hebrew
>> and Arabic alphabets to be used to represent hexadecimal digits 10 through
>> 15.
>>
>
> When preparing the examples in the current IRI spec (RFC 3987), I noticed
> that the '%' character's behavior is indeed rather. I don't remember the
> details off my head, but I would like to ask you to carefully check the idea
> with %-encoding and Arabic/Hebrew letters. It may work, but there might be
> some weird effects, so that it doesn't work as expected. If it turns out to
> work, it may be an interesting long-term addition. It would help getting
> over the problem, described in RFC 3987 as far as I remember, that if
> there's a single character in an RTL component that needs to be escaped, you
> may have to escape all characters.
>
>
>
>  But even if such hurdles were overcome, and it would become *possible* for
>> a
>> site to phrase all the IRIs it requires without mixing LTR and RTL
>> characters, this would only reduce user confusion. Third-party documents
>> (including those originating with spoofers) would probably continue
>> formulating mixed-direction IRIs that would display differently in
>> different
>> directional contexts, and sometimes seem like they belong to the site when
>> in fact they don't.
>>
>> So, should sites be encouraged to stop accepting mixed-direction IRIs, so
>> that they eventually become rare - and automatically suspect - on the web?
>>
>
> I personally don't think so. I think there is an important distinction
> between "behind the scenes" URIs/IRIs, long, complicated stuff that is
> difficult to fathom even in simple US-ASCII, and "front-side" URIs/IRIs,
> short things that get put on billboards and passed around on napkins,...
>
>
>
>  (For a site to detect mixed-direction IRIs would not be trivial. For
>> example, it will be hampered by having to include the domain name in the
>> check. Non-ASCII domain names are translated into (ASCII) punicode before
>> they get to the site. So, the site would have to first translate the
>> punicode back to the original non-ASCII domain name before checking that
>> the
>> IRI does not contain both LTR and RTL characters.)
>>
>
> Given that in general, sites have to deal just with one or a few domain
> names (some exceptions such as blogspot.com and the like will prove this
> rule), this should be rather easy for most sites. On Apache, it may be
> possible to do it with a few well-crafted rewrite engine rules.
>
>
>
>  Another alternative would be to use a limited
>>> set of markup within URLs so as to preserve the
>>> right ordering. It would suffice to allow RTM
>>> and LTM characters around the neutral
>>> characters.
>>>
>>
>> (The intent here is LRM and RLM, I think.)
>>
>> This approach requires a mechanism for determining which LRMs and RLMs in
>> an
>> IRI are just optional "visual sugar" for the user, and should thus be
>> removed before further processing of the IRI, and which are an integral
>> part
>> of the IRI.
>>
>
> Currently, the IRI spec doesn't allow any LRMs or RLMs, so at least on the
> spec level, this isn't a problem. Any raw LRMs/RLMs would be "visual sugar",
> any that are not visual sugar would have to be escaped. We would have to
> check carefully to what extent this distinction survives various operations
> on IRIs.
>
>
> Regards,    Martin.
>
>
>
>  Such a mechanism would have to deal with the different nature of
>> different parts of the IRI (e.g. domain name, path, and query string), and
>> would likely affect many of the layers involved in the processing of an
>> IRI:
>> e.g. browsers (for LRMs and RLMs in the domain name before it is
>> translated
>> into punicode), HTTP web server software (for LRMs and RLMs in the path),
>> and the site's final code layers that process the query string.
>>
>> Not trivial...
>>
>> Furthermore, we still have the same problem as above: that some documents
>> containing IRIs will bother to use LRMs and RLMs in them does not mean
>> that
>> *all* documents will. (For example, it is difficult to imagine a user
>> manually typing an IRI into an e-mail with LRMs or RLMs.) And thus, users
>> will become used to seeing IRIs being displayed every which way, making
>> spoofing that much easier. It is not clear to me that allowing the use of
>> LRMs and RLMs in IRIs would reduce the problem or make it even larger.
>>
>> Aharon
>>
>> On Tue, May 25, 2010 at 3:10 AM, Mark Davis ☕<mark@macchiato.com>  wrote:
>>
>>  There has been some discussion of having a special ordering for BIDI URLs
>>> so that they are more understandable to users. (I'll use URL in the broad
>>> sense, as including non-ASCII characters.) This is a complicated issue,
>>> and
>>> I can't claim to have all the answers, but here are some thoughts on the
>>> issue.
>>>
>>> In the Unicode consortium, we've been aware of this issue, and have
>>> considered options a number of times over the years. However, we have not
>>> yet heard a good case for how supporting uniform field direction in URLs
>>> can
>>> be done without significant compatibility and security problems. There
>>> are
>>> some big stumbling blocks:
>>>
>>>    - Many clients that display URLs will either not be URL aware, or not
>>>    be aware of the latest standard, or not be able to parse out text as
>>>    definitively belonging to a URL.
>>>    - The specs have no termination criteria for parsing URLs in plain
>>>    text. So http://abc.def#ghi could be "http://abc.def#ghi" or could be
>>> "
>>>    http://abc.def#ghi* could*", since fragments can include spaces. (And
>>>    in languages that don't use spaces to separate words, this is further
>>>    complicated.) Different applications have different heuristics for
>>> this, but
>>>    those heuristics don't always agree.
>>>    - Many applications heuristically recognize fragment URLs, like "
>>>    google.com". So in a broad sense, people understand a URL as
>>> "something
>>>    that I could paste into an address bar in my browser and will get me
>>> to a
>>>    page", and have the expectation that they will order similarly. That
>>> is,
>>>    ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would
>>> be
>>>    confusing.
>>>
>>> Why is ordering a problem? Suppose I have the URL http://ABC.DEF.
>>> Currently, any application that displays BIDI will do it as either
>>> http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one.
>>> If
>>> an application starts to display it as http://CBA.FED, then it
>>> represents
>>> a significant security problem, since the user will think it is the
>>> different URL http://DEF.ABC. As long as there is significant percentage
>>> of old applications, there will be the opportunity for that problem. The
>>> same goes for LTR URLs in a RTL environment.
>>>
>>> Moreover, if I paste text between applications, even where the paragraph
>>> direction is constant, then the labels can flip in arbitrary ways if some
>>> applications support uniform direction and some don't. The challenge is
>>> to
>>> get all applications to consistently (a) be URL aware, and (b) all switch
>>> to
>>> some new display order in unison. It might be that someone can come up
>>> with
>>> a way to handle this, but we haven't heard of one yet.
>>>
>>> (Had the importance of URL syntax been known at the time the consortium
>>> came up with the BIDI algorithm, and were the IRI syntax determinant
>>> enough
>>> that the termination could always be recognized, even in the midst of
>>> plain
>>> text, we'd be in a different world.)
>>>
>>> But we're not. The best way to solve the problem that I can think of can
>>> be
>>> done right now. Any significant site that wants to support BIDI languages
>>> should provide for the ability to have IRIs with *all *RTL characters:
>>> host name, path, query, fragment. If all the pieces are RTL text (or
>>> infixed
>>> neutrals), than the display has a consistent direction in both RTL and
>>> LTR
>>> environment, no matter whether the application is URL-aware or not, and
>>> users won't be confused. Now that the TLD can be RTL, I think there will
>>> be
>>> pressure for the sites to do that, since completely-RTL IRIs will work
>>> much
>>> better in all environments.
>>>
>>> [The one real remaining piece is the scheme; the IRI is still
>>> understandable (though ugly) if it has to be ASCII, but it would be
>>> somewhat
>>> better if it could have a RTL alias.  (Pure digit fields like IP
>>> addresses
>>> are a bit ugly, but seldom used.)]
>>>
>>> Another alternative would be to use a limited set of markup within URLs
>>> so
>>> as to preserve the right ordering. It would suffice to allow RTM and LTM
>>> characters around the neutral characters. Any BIDI URL could be
>>> normalized
>>> so as to include these characters in all and only the right places, by a
>>> compliant implementation. And once this was done, then the text can be
>>> cut
>>> and copies between applications with no change in appearance.
>>> However, one would come up with sufficient constraints on the use of
>>> these
>>> characters so as to prevent *their* being used for spoofing, and could
>>> have a problem with breakage on older implementations. (Although in a
>>> way,
>>> breaking is better than sending people to the wrong place.)
>>>
>>> Mark
>>>
>>> — Il meglio è l’inimico del bene —
>>>
>>>
>>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
Received on Tuesday, 25 May 2010 18:14:48 UTC