Special ordering for BIDI URLs from Mark Davis ☕ on 2010-05-25 (public-iri@w3.org from May 2010)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Mon, 24 May 2010 17:10:53 -0700
To: public-iri@w3.org, bidi@unicode.org, Shawn Steele <Shawn.Steele@microsoft.com>, Murray Sargent <murrays@exchange.microsoft.com>, aharon@google.com
Message-ID: <AANLkTimzIMbwCX4oOOQ2pz2wR5JWKk6AGboGgbn5Swc3@mail.gmail.com>
There has been some discussion of having a special ordering for BIDI URLs so
that they are more understandable to users. (I'll use URL in the broad
sense, as including non-ASCII characters.) This is a complicated issue, and
I can't claim to have all the answers, but here are some thoughts on the
issue.

In the Unicode consortium, we've been aware of this issue, and have
considered options a number of times over the years. However, we have not
yet heard a good case for how supporting uniform field direction in URLs can
be done without significant compatibility and security problems. There are
some big stumbling blocks:

   - Many clients that display URLs will either not be URL aware, or not be
   aware of the latest standard, or not be able to parse out text as
   definitively belonging to a URL.
   - The specs have no termination criteria for parsing URLs in plain
   text. So http://abc.def#ghi could be "http://abc.def#ghi" or could be "
   http://abc.def#ghi* could*", since fragments can include spaces. (And in
   languages that don't use spaces to separate words, this is further
   complicated.) Different applications have different heuristics for this, but
   those heuristics don't always agree.
   - Many applications heuristically recognize fragment URLs, like "
   google.com". So in a broad sense, people understand a URL as "something
   that I could paste into an address bar in my browser and will get me to a
   page", and have the expectation that they will order similarly. That is,
   ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would be
   confusing.

Why is ordering a problem? Suppose I have the URL http://ABC.DEF. Currently,
any application that displays BIDI will do it as either http://FED.CBA ( in
a LTR environment) or FED.CBA://http in a RTL one. If an application starts
to display it as http://CBA.FED, then it represents a significant security
problem, since the user will think it is the different URL http://DEF.ABC.
As long as there is significant percentage of old applications, there will
be the opportunity for that problem. The same goes for LTR URLs in a RTL
environment.

Moreover, if I paste text between applications, even where the paragraph
direction is constant, then the labels can flip in arbitrary ways if some
applications support uniform direction and some don't. The challenge is to
get all applications to consistently (a) be URL aware, and (b) all switch to
some new display order in unison. It might be that someone can come up with
a way to handle this, but we haven't heard of one yet.

(Had the importance of URL syntax been known at the time the consortium came
up with the BIDI algorithm, and were the IRI syntax determinant enough that
the termination could always be recognized, even in the midst of plain text,
we'd be in a different world.)

But we're not. The best way to solve the problem that I can think of can be
done right now. Any significant site that wants to support BIDI languages
should provide for the ability to have IRIs with *all *RTL characters: host
name, path, query, fragment. If all the pieces are RTL text (or infixed
neutrals), than the display has a consistent direction in both RTL and LTR
environment, no matter whether the application is URL-aware or not, and
users won't be confused. Now that the TLD can be RTL, I think there will be
pressure for the sites to do that, since completely-RTL IRIs will work much
better in all environments.

[The one real remaining piece is the scheme; the IRI is still understandable
(though ugly) if it has to be ASCII, but it would be somewhat better if it
could have a RTL alias.  (Pure digit fields like IP addresses are a bit
ugly, but seldom used.)]

Another alternative would be to use a limited set of markup within URLs so
as to preserve the right ordering. It would suffice to allow RTM and LTM
characters around the neutral characters. Any BIDI URL could be normalized
so as to include these characters in all and only the right places, by a
compliant implementation. And once this was done, then the text can be cut
and copies between applications with no change in appearance.
However, one would come up with sufficient constraints on the use of these
characters so as to prevent *their* being used for spoofing, and could have
a problem with breakage on older implementations. (Although in a way,
breaking is better than sending people to the wrong place.)

Mark

— Il meglio è l’inimico del bene —
Received on Tuesday, 25 May 2010 00:11:31 UTC