Re: BIDI?

Mark

*— Il meglio è l’inimico del bene —*


On Wed, Apr 27, 2011 at 01:38, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote:
>
> [Please note that Mark's document, as far as the IETF is concerned, does
> not have any standing at all. Sending the actual text to the mailing list,
> or submitting the document as an Internet-Draft would fix that.]


Here you go.

BIDI URL Display

M. Davis

Rough Draft 2011-04-26
Contents


BIDI URL Display<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.m18nbt5lll2y>

Recognizing IRIs in
Plaintext<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.py5tagyznmja>

Open issues:<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.r4fahox3hb2n>

Applying the BIDI
algorithm<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.rapo13ddq165>

ELM<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.hmjer2p3qt90>

The Unicode Bidirectional Algorithm (UBA) was designed for handling normal
text, and predated the rise of the web. Unfortunately, URLs* are not normal
text; they are syntactically complex in ways that don’t work well with the
UBA. That causes URLs to appear jumbled to bidi users (Arabic/Hebrew/…).
[*Formally speaking, what we are talking about are IRIs, although most end
users know them as “URLs”, see
idn-and-iri<http://www.w3.org/International/articles/idn-and-iri/>
.]

People have been looking for an extension to the Unicode Bidirectional
Algorithm (UBA) that handles IRIs in a more consistent way for bidi users.
The general goal would be for the “fields” of an IRI to flow in a consistent
direction.

This requires consistency in usage across different applications. For
example, when someone copies the contents of an address bar into an email,
we don’t want all the fields in the URL to switch around. Such consistency
would require a general extension to the UBA to indicate how IRIs to be
should be displayed, both in the limited context of an address bar, and in
other contexts.

The challenges are:

   1. There will be a long migration period, so making sure that the
   negative effects are mitigated as much as possible.
   2. For the purpose of the UBA, having a simple, comprehensible way to
   recognize IRIs in plaintext.


Recognizing IRIs in PlaintextHere are rough thoughts about the latter task
(recognizing IRIs in plaintext) for use with the UBA.

There are two problems:

   1. A formal reading of the spec allows almost anything in fields, so it
   is hard to test for termination.
   2. The URL may not be headed by a scheme in plaintext (eg:
google.comshould be recognized).



While in theory, almost anything can occur in fields, in practice many
Unicode characters never or very rarely occur in these contexts. So one
approach is to have a simplified syntax could be easily recognized. Any
characters that were legal, but outside of that syntax would need to be
represented with % escapes if they are to be handed by the UBA.

bidiIri := ((scheme “://” domain) | domain2) (“/” path)? (“?” query)? (“#”
fragment)?

domain := UTS46Chars + ( IDNSep UTS46Chars+)* IDNSep?

domain2 := domain IDNSep TLD IDNSep?

path := (char - “?” - “#”)*

query := (char - “#”)*

fragment := char*

IDNSep := [\u002E \uFF0E \u3002\uFF61] // see
http://unicode.org/reports/tr46/#Notation<http://unicode.org/reports/tr46/#Notation>

TLD := <list on http://www.iana.org/domains/root/db/>

char := percentEncodedUTF8

         | [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:] inclusionChar -
exclusionChar]

inclusionChar :=

U+0021 <http://unicode.org/cldr/utility/character.jsp?a=0021> ( ! )
EXCLAMATION MARK

U+0022 <http://unicode.org/cldr/utility/character.jsp?a=0022> ( " )
QUOTATION MARK

U+0023 <http://unicode.org/cldr/utility/character.jsp?a=0023> ( # ) NUMBER
SIGN

U+0025 <http://unicode.org/cldr/utility/character.jsp?a=0025> ( % ) PERCENT
SIGN

U+0026 <http://unicode.org/cldr/utility/character.jsp?a=0026> ( & )
AMPERSAND

U+0027 <http://unicode.org/cldr/utility/character.jsp?a=0027> ( ' )
APOSTROPHE

U+002A <http://unicode.org/cldr/utility/character.jsp?a=002A> ( * ) ASTERISK

U+002C <http://unicode.org/cldr/utility/character.jsp?a=002C> ( , ) COMMA

U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E> ( . ) FULL
STOP

U+002F <http://unicode.org/cldr/utility/character.jsp?a=002F> ( / ) SOLIDUS

U+003A <http://unicode.org/cldr/utility/character.jsp?a=003A> ( : ) COLON

U+003B <http://unicode.org/cldr/utility/character.jsp?a=003B> ( ; )
SEMICOLON

U+003F <http://unicode.org/cldr/utility/character.jsp?a=003F> ( ? ) QUESTION
MARK

U+0040 <http://unicode.org/cldr/utility/character.jsp?a=0040> ( @ )
COMMERCIAL AT

U+005C <http://unicode.org/cldr/utility/character.jsp?a=005C> ( \ ) REVERSE
SOLIDUS

U+00A1 <http://unicode.org/cldr/utility/character.jsp?a=00A1> ( ¡ ) INVERTED
EXCLAMATION MARK

U+00B7 <http://unicode.org/cldr/utility/character.jsp?a=00B7> ( · ) MIDDLE
DOT

U+00BF <http://unicode.org/cldr/utility/character.jsp?a=00BF> ( ¿ ) INVERTED
QUESTION MARK

exclusionChar :=

U+003C <http://unicode.org/cldr/utility/character.jsp?a=003C> ( < )
LESS-THAN SIGN

U+003E <http://unicode.org/cldr/utility/character.jsp?a=003E> ( > )
GREATER-THAN SIGN

In addition, a final inclusionChar is excluded from the bidiIri when parsing
(it is possible to capture this with the BNF, but it makes the formulation
much less readable).

For the full IRI syntax, see:
http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html

   - TBD: add userinfo, port, IP, etc.
   - TBD: include some statistics on the character usage in URLs.


Termination <http://unicode.org/reports/tr46/#Notation>Thus the bidi IRI
basically terminates with:

   1. Unassigned, surrogates, private-use, control codes
   2. Whitespace
   3. Open, close or most ‘other’ punctuation.



For ASCII and Latin1, the items in #3 are:

U+003C <http://unicode.org/cldr/utility/character.jsp?a=003C> ( < )
LESS-THAN SIGN

U+003E <http://unicode.org/cldr/utility/character.jsp?a=003E> ( > )
GREATER-THAN SIGN

U+0028 <http://unicode.org/cldr/utility/character.jsp?a=0028> ( ( ) LEFT
PARENTHESIS

U+0029 <http://unicode.org/cldr/utility/character.jsp?a=0029> ( ) ) RIGHT
PARENTHESIS

U+005B <http://unicode.org/cldr/utility/character.jsp?a=005B> ( [ ) LEFT
SQUARE BRACKET

U+005D <http://unicode.org/cldr/utility/character.jsp?a=005D> ( ] ) RIGHT
SQUARE BRACKET

U+007B <http://unicode.org/cldr/utility/character.jsp?a=007B> ( { ) LEFT
CURLY BRACKET

U+007D <http://unicode.org/cldr/utility/character.jsp?a=007D> ( } ) RIGHT
CURLY BRACKET

U+00AB <http://unicode.org/cldr/utility/character.jsp?a=00AB> ( « )
LEFT-POINTING DOUBLE ANGLE QUOTATION MARK

U+00BB <http://unicode.org/cldr/utility/character.jsp?a=00BB> ( » )
RIGHT-POINTING DOUBLE ANGLE QUOTATION
MARK<http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html>
Open issues:

   - Should [:Cf:] (or some of them) also require percent encoding?
   - The inclusionChars are from ASCII/Latin1. Do we want to tweek any of
   the ASCII/Latin1 exceptions to move them to terminating characters?


Applying the BIDI algorithmHere’s how we’d apply the bidi algorithm
extension (this is just a strawman, and needs more discussion).

Given a bidiIri:

A separator is defined as any instance of the quoted strings in the bidiIri
line above, or:

   - IDNSep in a domain
   - “/” in a path
   - “=” or “&” in a query



A field is text between separators, or at the front/end.

If a bidiIri is recognized, then it is handled by the UBA as if each field
is surrounded by LRM.

Received on Wednesday, 27 April 2011 14:36:41 UTC