RE: BIDI : tackling the delimiter weirdness

If there is any way that the IRI document itself could avoid recommending visual display methods, that would be better.  I think this is a topic that could really get in the way of quick progress.

What’s the minimum we need to actually accomplish in the main IRI document? It’s likely that the “best practice for display” would be non-normative in any case.

I’d also like to look for ways to avoid getting too deeply into the “spoofing” issues, since spoofing is a more general topic which also involves URIs that are all ASCII.

It may be that the best advice for IRIs is to note that the opportunities for spoofing exist for URIs (MlCR0S0FT.COM) as well as IRIs, but that they are made much worse, but defer to a separate (in preparation) lengthier discussion of how display of URIs or IRIs and expectations of users to reach conclusions by visual comparison.

Separating these topics would also allow evolution of “best practices”  based on the kind of usability studies Mark is proposing, while leaving the base protocol element definition alone.

Do you think this approach would be helpful in making more rapid progress on the core deliverables?


From: [] On Behalf Of Mark Davis ?
Sent: Wednesday, January 27, 2010 6:06 PM
To: Shawn Steele
Cc: Slim Amamou;
Subject: Re: BIDI : tackling the delimiter weirdness

Various people have considered a special reordering of labels before. The problem is that while in the address bar or other special locations one could have a special handling for the order of the labels, it is really bad if the labels aren't in the same order everywhere the URL could appear - the spoofing possibilities are unpleasant. And everywhere means in everyone's address bar in every browser, and in plaintext, and in emailers, and so on.


On Wed, Jan 27, 2010 at 15:44, Shawn Steele <<>> wrote:
I'm not sure that solves the problem.  Specifically with these examples:

  Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
  Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"

"real users" seem to get confused by the HGF.EDC behavior, and instead expect the data to have the hierarchy remain in a consistent direction, eg: http://ab.EDC.HGF/ij/kl/mn/op.html seems to be the expected behavior.  As far as I can tell, the swapping of the 2 in the hierarchy is not intuitive to those that don't understand the Unicode bidi algorithm.  There also seems to be little variation in the users expectations in this respect.  This is arrived at from feedback from the Saudi gov't on IE's IDN behavior, and some casual user feedback.  We have yet to conduct more formal usability testing.

Furthermore it isn't clear to me that users in a bidi context would really prefer the individual labels/elements to be represented with the hierarchy reading from LTR.  This is less clear though.  Specifically it was suggested that the elements render from RTL if they include RTL elements, eg: html.op/mn/kl/ij/HGF.CDE.ab//:http  -  As I said, this expectation seems less certain.  Furthermore some users expressed a desire for even ASCII URLs to read in RTL order when displayed in a RTL machine with RTL UI, eg:

I don't think that it's appropriate for the WG as "engineers" to state what's best here, I believe we have biases and an understanding of computers not available to the average user.  I'd prefer a real usability study to determine the user expectations:

* Validate that the list/hierarchy model fits user expectations.
* Determine whether that list should be displayed in LTR or RTL when the list contains elements that are RTL.
* Determine if there are times when a general LTR or RTL directionality of the list elements are unexpected.  (eg: all-ASCII, but on an RTL system).


Windows UX

-----Original Message-----
From:<> [<>] On Behalf Of Slim Amamou
Sent: Poʻakolu, Ianuali 27, 2010 9:53 AM
Subject: BIDI : tackling the delimiter weirdness

Hello everybody,
congratulations for the WG.

Sometimes BIDI IRIs look really weird. For instance, the most advanced examples in section 4.4, beginning with example 5, are really confusing for an Arabic script reader like me. But I had time to think about it since 2007 when IDN wiki first started, and I think I nailed the problem and I am coming with a proposition.

section 4.2.  Bidi IRI Structure
>   (...) some restrictions on bidirectional IRIs
>   are necessary.  These restrictions are given in terms of delimiters
>   (structural characters, mostly punctuation such as "@", ".", ":",
> and
>   "/") and components (usually consisting mostly of letters and
>   digits).

Delimiters are at the core of the issue. I suggest a more in depth explanation of their usage in conjunction with components. For most IRI schemas, delimiters define a relationship between their left component and their right component. Most of the time this relationship is a hierarchical relationship.

ex. for http: the "/" defines a hierarchy between the path components whereas A/B/C means actually : A includes B which in turn includes C .
Note here that the inclusion relationship is *directional* : left component includes right component and thus the "/" delimiter in the
http: schema has a LTR "directionality". It is this directionality which is broken by the examples in the IRI and which creates confusion.

Another ex. in domain names, the "." delimiter also defines a hierarchy but this time the directionality is RTL.

I think the IRI draft should state that schema definitions MUST define their delimiters relationships and directionality. That would solve the problem.

section 4.4.  Examples
> (...)
>   Example 5: Example 2, applied to components of different kinds:
>   Logical representation: ""
>   Visual representation: ""
>   The inversion of the domain name label and the path component may be
>   unexpected, but it is consistent with other bidi behavior.  For
>   reassurance that the domain component really is "", it may
> be
>   helpful to read aloud the visual representation following the bidi
>   algorithm.  After "" one reads the RTL block
>   "E-F-slash-G-H", which corresponds to the logical representation.
>   Example 6: Same as Example 5, with more rtl components:
>   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
>   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
>   The inversion of the domain name labels and the path components may
>   be easier to identify because the delimiters also move.
>   Example 7: A single rtl component includes digits:
>   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
>   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
>   Numbers are written ltr in all cases but are treated as an
> additional
>   embedding inside a run of rtl characters.  This is completely
>   consistent with usual bidirectional text.
>   Example 8 (not allowed): Numbers are at the start or end of an rtl
>   component:
>   Logical representation: ""
>   Visual representation: ""
>   The sequence "1/2" is interpreted by the bidi algorithm as a
>   fraction, fragmenting the components and leading to confusion.
> There
>   are other characters that are interpreted in a special way close to
>   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".
>   Example 9 (not allowed): The numbers in the previous example are
>   percent-encoded:
>   Logical representation: "",
>   Visual representation: ""
>   Example 10 (allowed but not recommended):
>   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
>   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
>   Components consisting of only numbers are allowed (it would be
> rather
>   difficult to prohibit them), but these may interact with adjacent
>   components in ways that are not easy to predict.
>   Example 11 (allowed but not recommended):
>   Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
>   Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
>   Components consisting of numbers and left-to-right characters are
>   allowed, but these may interact with adjacent RTL components in ways
>   that are not easy to predict.

Slim Amamou | سليم عمامو

Received on Thursday, 28 January 2010 02:27:35 UTC