W3C home > Mailing lists > Public > public-iri@w3.org > May 2010

RE: [bidi] Special ordering for BIDI URLs

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Wed, 26 May 2010 15:08:13 +0000
To: Adil Allawi <adil@diwan.com>, Mark Davis ☕ <mark@macchiato.com>
CC: "public-iri@w3.org" <public-iri@w3.org>, "bidi@unicode.org" <bidi@unicode.org>, Murray Sargent <murrays@exchange.microsoft.com>, "aharon@google.com" <aharon@google.com>, Nasser Kettani <Nasser.Kettani@microsoft.com>
Message-ID: <E14011F8737B524BB564B05FF748464A0D9E71B0@TK5EX14MBXC139.redmond.corp.microsoft.com>
I think the behavior expresses my desired output, but I don't think you can do it with new characters :)  Way too much would break.  (Like DNS, which invented punycode to work around updating all the servers to UTF-8 :))

-Shawn
________________________________
From: Adil Allawi [adil@diwan.com]
Sent: Wednesday, May 26, 2010 1:20 AM
To: Mark Davis ☕
Cc: public-iri@w3.org; bidi@unicode.org; Shawn Steele; Murray Sargent; aharon@google.com
Subject: Re: [bidi] Special ordering for BIDI URLs

Hi Mark,

I have been thinking about this issue for a while. I have a proposal that has been inspired by Aharon's <bdi> tag and the work that I have done on encoding the Arabic mathematical symbols.

My suggestion is as follows:

1. Unicode defines new characters that match the common IRI delimiters,namely, from rfc3987:

   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   and "."


2. These characters would have unique names, properties and a unique Unicode Bidi class - IRI Separator.

3. The Unicode Bidi algorithm would be extended to have special processing for IRI Separator class.

4. The IRI delimiters will be required any time that an rtl character is encoded in an IRI - or, if that is too limiting, any time there is mixed rtl, ltr or number characters in an IRI. This will be a requirement that will be enforced by several systems used in processing IRIs to prevent spoofing with ASCII characters - e.g. name servers, HTTP servers, Browsers, etc.

5. For older systems the IRI separators would fall back to their ASCII equivalents when converted to URIs - but only for the purposes of compatibility not display.

6. The Unicode Bidi algorithm will treat the IRI Separator class as a block separator, or a separate embedding level. So groups of characters that are between IRI Separators will order separately from the IRI Separator and the IRI Separator/character groups will be ordered in the main direction of the line. In this way it is possible to have IRIs rendered right-to-left or left-to-right depending on the reading direction of the human viewing it. There would also need to be special processing for the bracket delimiters.

So for the following example IRI (capitals are rtl) that is in logical order:

http://ABC.def#GHI


will be processed as the following sequence of bidi runs:

<"http"><":"><"/"><"/"><"ABC"><"."><"def"><"#"><"GHI">

and rendered LTR as:

http://CBA.def#IHG


and RTL as:

IHG#def.CBA//:http

As the separators would have unique Unicode values there would be no ambiguity with the bidi ordering and the use of the ASCII delimiters can be prevented. If the program displaying this IRI does not have a Bidi algorithm that can handle IRI-Separator characters then the characters would default to being strong LTR which should still generate acceptable results. I think this proposal would also negate the need for the special bidi processing that rfc 3987 recommends.

I generally believe that new characters should not be encoded where alternatives exist or that more classes should be added to the Unicode Bidi Algorithm, but this is a special case. There are serious security issues and this would correct a universal problem for the Internet.

This follows the precedent for Arabic maths symbols where characters have been encoded even though there is a visual equivalent elsewhere in Unicode, especially because there is a different meaning and there are other examples in Unicode e.g. U+060d Arabic Date Separator.

I have not discussed this idea before, so I wait for this proposal to be generally shot down for missing some blindingly obvious problem.

regards

Adil Allawi

On 25/05/2010 01:10, Mark Davis ☕ wrote:
There has been some discussion of having a special ordering for BIDI URLs so that they are more understandable to users. (I'll use URL in the broad sense, as including non-ASCII characters.) This is a complicated issue, and I can't claim to have all the answers, but here are some thoughts on the issue.

In the Unicode consortium, we've been aware of this issue, and have considered options a number of times over the years. However, we have not yet heard a good case for how supporting uniform field direction in URLs can be done without significant compatibility and security problems. There are some big stumbling blocks:

  *   Many clients that display URLs will either not be URL aware, or not be aware of the latest standard, or not be able to parse out text as definitively belonging to a URL.
  *   The specs have no termination criteria for parsing URLs in plain text. So http://abc.def#ghi could be "http://abc.def#ghi" or could be "http://abc.def#ghi could", since fragments can include spaces. (And in languages that don't use spaces to separate words, this is further complicated.) Different applications have different heuristics for this, but those heuristics don't always agree.
  *   Many applications heuristically recognize fragment URLs, like "google.com<http://google.com>". So in a broad sense, people understand a URL as "something that I could paste into an address bar in my browser and will get me to a page", and have the expectation that they will order similarly. That is, ordering "GOOGLE.COM<http://GOOGLE.COM>" one way and "http://GOOGLE.COM" another would be confusing.

Why is ordering a problem? Suppose I have the URL http://ABC.DEF. Currently, any application that displays BIDI will do it as either http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one. If an application starts to display it as http://CBA.FED, then it represents a significant security problem, since the user will think it is the different URL http://DEF.ABC. As long as there is significant percentage of old applications, there will be the opportunity for that problem. The same goes for LTR URLs in a RTL environment.

Moreover, if I paste text between applications, even where the paragraph direction is constant, then the labels can flip in arbitrary ways if some applications support uniform direction and some don't. The challenge is to get all applications to consistently (a) be URL aware, and (b) all switch to some new display order in unison. It might be that someone can come up with a way to handle this, but we haven't heard of one yet.

(Had the importance of URL syntax been known at the time the consortium came up with the BIDI algorithm, and were the IRI syntax determinant enough that the termination could always be recognized, even in the midst of plain text, we'd be in a different world.)

But we're not. The best way to solve the problem that I can think of can be done right now. Any significant site that wants to support BIDI languages should provide for the ability to have IRIs with all RTL characters: host name, path, query, fragment. If all the pieces are RTL text (or infixed neutrals), than the display has a consistent direction in both RTL and LTR environment, no matter whether the application is URL-aware or not, and users won't be confused. Now that the TLD can be RTL, I think there will be pressure for the sites to do that, since completely-RTL IRIs will work much better in all environments.

[The one real remaining piece is the scheme; the IRI is still understandable (though ugly) if it has to be ASCII, but it would be somewhat better if it could have a RTL alias.  (Pure digit fields like IP addresses are a bit ugly, but seldom used.)]

Another alternative would be to use a limited set of markup within URLs so as to preserve the right ordering. It would suffice to allow RTM and LTM characters around the neutral characters. Any BIDI URL could be normalized so as to include these characters in all and only the right places, by a compliant implementation. And once this was done, then the text can be cut and copies between applications with no change in appearance.
However, one would come up with sufficient constraints on the use of these characters so as to prevent their being used for spoofing, and could have a problem with breakage on older implementations. (Although in a way, breaking is better than sending people to the wrong place.)

Mark

— Il meglio è l’inimico del bene —
Received on Wednesday, 26 May 2010 15:11:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:51:57 GMT