BIDI : tackling the delimiter weirdness

Hello everybody,
congratulations for the WG.

Sometimes BIDI IRIs look really weird. For instance, the most advanced
examples in section 4.4, beginning with example 5, are really
confusing for an Arabic script reader like me. But I had time to think
about it since 2007 when IDN wiki first started, and I think I nailed
the problem and I am coming with a proposition.

http://www.ietf.org/id/draft-duerst-iri-bis-07.txt

section 4.2.  Bidi IRI Structure
>
>   (...) some restrictions on bidirectional IRIs
>   are necessary.  These restrictions are given in terms of delimiters
>   (structural characters, mostly punctuation such as "@", ".", ":", and
>   "/") and components (usually consisting mostly of letters and
>   digits).

Delimiters are at the core of the issue. I suggest a more in depth
explanation of their usage in conjunction with components. For most
IRI schemas, delimiters define a relationship between their left
component and their right component. Most of the time this
relationship is a hierarchical relationship.

ex. for http: the "/" defines a hierarchy between the path components
whereas A/B/C means actually : A includes B which in turn includes C .
Note here that the inclusion relationship is *directional* : left
component includes right component and thus the "/" delimiter in the
http: schema has a LTR "directionality". It is this directionality
which is broken by the examples in the IRI and which creates
confusion.

Another ex. in domain names, the "." delimiter also defines a
hierarchy but this time the directionality is RTL.

I think the IRI draft should state that schema definitions MUST define
their delimiters relationships and directionality. That would solve
the problem.

section 4.4.  Examples
> (...)
>   Example 5: Example 2, applied to components of different kinds:
>   Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
>   Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
>   The inversion of the domain name label and the path component may be
>   unexpected, but it is consistent with other bidi behavior.  For
>   reassurance that the domain component really is "ab.cd.EF", it may be
>   helpful to read aloud the visual representation following the bidi
>   algorithm.  After "http://ab.cd." one reads the RTL block
>   "E-F-slash-G-H", which corresponds to the logical representation.
>
>   Example 6: Same as Example 5, with more rtl components:
>   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
>   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
>   The inversion of the domain name labels and the path components may
>   be easier to identify because the delimiters also move.
>
>   Example 7: A single rtl component includes digits:
>   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
>   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
>   Numbers are written ltr in all cases but are treated as an additional
>   embedding inside a run of rtl characters.  This is completely
>   consistent with usual bidirectional text.
>
>   Example 8 (not allowed): Numbers are at the start or end of an rtl
>   component:
>   Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
>   Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
>   The sequence "1/2" is interpreted by the bidi algorithm as a
>   fraction, fragmenting the components and leading to confusion.  There
>   are other characters that are interpreted in a special way close to
>   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".
>
>   Example 9 (not allowed): The numbers in the previous example are
>   percent-encoded:
>   Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
>   Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html"
>
>   Example 10 (allowed but not recommended):
>   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
>   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
>   Components consisting of only numbers are allowed (it would be rather
>   difficult to prohibit them), but these may interact with adjacent RTL
>   components in ways that are not easy to predict.
>
>   Example 11 (allowed but not recommended):
>   Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
>   Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
>   Components consisting of numbers and left-to-right characters are
>   allowed, but these may interact with adjacent RTL components in ways
>   that are not easy to predict.


--
Slim Amamou | سليم عمامو
http://alixsys.com

Received on Wednesday, 27 January 2010 17:53:54 UTC