- From: Mark Davis ☕ <mark@macchiato.com>
- Date: Wed, 27 Apr 2011 07:36:08 -0700
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Slim Amamou <slim@alixsys.com>, Shawn Steele <Shawn.Steele@microsoft.com>, "public-iri@w3.org" <public-iri@w3.org>, bidi@unicode.org
- Message-ID: <BANLkTimsQOsbJxDQcP5YozBoZHOb--7oWQ@mail.gmail.com>
Mark *— Il meglio è l’inimico del bene —* On Wed, Apr 27, 2011 at 01:38, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote: > > [Please note that Mark's document, as far as the IETF is concerned, does > not have any standing at all. Sending the actual text to the mailing list, > or submitting the document as an Internet-Draft would fix that.] Here you go. BIDI URL Display M. Davis Rough Draft 2011-04-26 Contents BIDI URL Display<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.m18nbt5lll2y> Recognizing IRIs in Plaintext<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.py5tagyznmja> Open issues:<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.r4fahox3hb2n> Applying the BIDI algorithm<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.rapo13ddq165> ELM<https://docs.google.com/document/d/1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en&pli=1#heading=h.hmjer2p3qt90> The Unicode Bidirectional Algorithm (UBA) was designed for handling normal text, and predated the rise of the web. Unfortunately, URLs* are not normal text; they are syntactically complex in ways that don’t work well with the UBA. That causes URLs to appear jumbled to bidi users (Arabic/Hebrew/…). [*Formally speaking, what we are talking about are IRIs, although most end users know them as “URLs”, see idn-and-iri<http://www.w3.org/International/articles/idn-and-iri/> .] People have been looking for an extension to the Unicode Bidirectional Algorithm (UBA) that handles IRIs in a more consistent way for bidi users. The general goal would be for the “fields” of an IRI to flow in a consistent direction. This requires consistency in usage across different applications. For example, when someone copies the contents of an address bar into an email, we don’t want all the fields in the URL to switch around. Such consistency would require a general extension to the UBA to indicate how IRIs to be should be displayed, both in the limited context of an address bar, and in other contexts. The challenges are: 1. There will be a long migration period, so making sure that the negative effects are mitigated as much as possible. 2. For the purpose of the UBA, having a simple, comprehensible way to recognize IRIs in plaintext. Recognizing IRIs in PlaintextHere are rough thoughts about the latter task (recognizing IRIs in plaintext) for use with the UBA. There are two problems: 1. A formal reading of the spec allows almost anything in fields, so it is hard to test for termination. 2. The URL may not be headed by a scheme in plaintext (eg: google.comshould be recognized). While in theory, almost anything can occur in fields, in practice many Unicode characters never or very rarely occur in these contexts. So one approach is to have a simplified syntax could be easily recognized. Any characters that were legal, but outside of that syntax would need to be represented with % escapes if they are to be handed by the UBA. bidiIri := ((scheme “://” domain) | domain2) (“/” path)? (“?” query)? (“#” fragment)? domain := UTS46Chars + ( IDNSep UTS46Chars+)* IDNSep? domain2 := domain IDNSep TLD IDNSep? path := (char - “?” - “#”)* query := (char - “#”)* fragment := char* IDNSep := [\u002E \uFF0E \u3002\uFF61] // see http://unicode.org/reports/tr46/#Notation<http://unicode.org/reports/tr46/#Notation> TLD := <list on http://www.iana.org/domains/root/db/> char := percentEncodedUTF8 | [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:] inclusionChar - exclusionChar] inclusionChar := U+0021 <http://unicode.org/cldr/utility/character.jsp?a=0021> ( ! ) EXCLAMATION MARK U+0022 <http://unicode.org/cldr/utility/character.jsp?a=0022> ( " ) QUOTATION MARK U+0023 <http://unicode.org/cldr/utility/character.jsp?a=0023> ( # ) NUMBER SIGN U+0025 <http://unicode.org/cldr/utility/character.jsp?a=0025> ( % ) PERCENT SIGN U+0026 <http://unicode.org/cldr/utility/character.jsp?a=0026> ( & ) AMPERSAND U+0027 <http://unicode.org/cldr/utility/character.jsp?a=0027> ( ' ) APOSTROPHE U+002A <http://unicode.org/cldr/utility/character.jsp?a=002A> ( * ) ASTERISK U+002C <http://unicode.org/cldr/utility/character.jsp?a=002C> ( , ) COMMA U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E> ( . ) FULL STOP U+002F <http://unicode.org/cldr/utility/character.jsp?a=002F> ( / ) SOLIDUS U+003A <http://unicode.org/cldr/utility/character.jsp?a=003A> ( : ) COLON U+003B <http://unicode.org/cldr/utility/character.jsp?a=003B> ( ; ) SEMICOLON U+003F <http://unicode.org/cldr/utility/character.jsp?a=003F> ( ? ) QUESTION MARK U+0040 <http://unicode.org/cldr/utility/character.jsp?a=0040> ( @ ) COMMERCIAL AT U+005C <http://unicode.org/cldr/utility/character.jsp?a=005C> ( \ ) REVERSE SOLIDUS U+00A1 <http://unicode.org/cldr/utility/character.jsp?a=00A1> ( ¡ ) INVERTED EXCLAMATION MARK U+00B7 <http://unicode.org/cldr/utility/character.jsp?a=00B7> ( · ) MIDDLE DOT U+00BF <http://unicode.org/cldr/utility/character.jsp?a=00BF> ( ¿ ) INVERTED QUESTION MARK exclusionChar := U+003C <http://unicode.org/cldr/utility/character.jsp?a=003C> ( < ) LESS-THAN SIGN U+003E <http://unicode.org/cldr/utility/character.jsp?a=003E> ( > ) GREATER-THAN SIGN In addition, a final inclusionChar is excluded from the bidiIri when parsing (it is possible to capture this with the BNF, but it makes the formulation much less readable). For the full IRI syntax, see: http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html - TBD: add userinfo, port, IP, etc. - TBD: include some statistics on the character usage in URLs. Termination <http://unicode.org/reports/tr46/#Notation>Thus the bidi IRI basically terminates with: 1. Unassigned, surrogates, private-use, control codes 2. Whitespace 3. Open, close or most ‘other’ punctuation. For ASCII and Latin1, the items in #3 are: U+003C <http://unicode.org/cldr/utility/character.jsp?a=003C> ( < ) LESS-THAN SIGN U+003E <http://unicode.org/cldr/utility/character.jsp?a=003E> ( > ) GREATER-THAN SIGN U+0028 <http://unicode.org/cldr/utility/character.jsp?a=0028> ( ( ) LEFT PARENTHESIS U+0029 <http://unicode.org/cldr/utility/character.jsp?a=0029> ( ) ) RIGHT PARENTHESIS U+005B <http://unicode.org/cldr/utility/character.jsp?a=005B> ( [ ) LEFT SQUARE BRACKET U+005D <http://unicode.org/cldr/utility/character.jsp?a=005D> ( ] ) RIGHT SQUARE BRACKET U+007B <http://unicode.org/cldr/utility/character.jsp?a=007B> ( { ) LEFT CURLY BRACKET U+007D <http://unicode.org/cldr/utility/character.jsp?a=007D> ( } ) RIGHT CURLY BRACKET U+00AB <http://unicode.org/cldr/utility/character.jsp?a=00AB> ( « ) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK U+00BB <http://unicode.org/cldr/utility/character.jsp?a=00BB> ( » ) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK<http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html> Open issues: - Should [:Cf:] (or some of them) also require percent encoding? - The inclusionChars are from ASCII/Latin1. Do we want to tweek any of the ASCII/Latin1 exceptions to move them to terminating characters? Applying the BIDI algorithmHere’s how we’d apply the bidi algorithm extension (this is just a strawman, and needs more discussion). Given a bidiIri: A separator is defined as any instance of the quoted strings in the bidiIri line above, or: - IDNSep in a domain - “/” in a path - “=” or “&” in a query A field is text between separators, or at the front/end. If a bidiIri is recognized, then it is handled by the UBA as if each field is surrounded by LRM.
Received on Wednesday, 27 April 2011 14:36:41 UTC