- From: Adil Allawi <adil@diwan.com>
- Date: Tue, 07 Jun 2011 14:14:09 +0100
- To: public-iri@w3.org
- Message-ID: <4DEE2421.1090706@diwan.com>
Following is a sunmary of the discussion so far.
I tried to write it as text but got lost in the details so I created a
mind map. You can see it as an image here:
http://ironymark.diwan.com/2011/06/the-trials-of-bidi-iris/ and the text
of the map is below. Please tell me if you see any omissions or want to
add any points:
* Bidi Iri
o Consistency
+ usage across different applications
+ copy the contents of an address bar into an email
o Migration
+ There will be a long migration period, so making sure
that the negative effects are mitigated as much as
possible.
o Usability
+ simple, comprehensible way to recognize IRIs in plaintext
+ easy and unambiguous human translation from a
displayed IRI (napkin, bus side) to the corresponding
logical string
o Security
WWW.HACKERS.COM/com.bank.www
<http://WWW.HACKERS.COM/com.bank.www> would be displayed
as www.bank.com/MOC.SREKCAH.WWW
<http://www.bank.com/MOC.SREKCAH.WWW>, the same as
www.bank.com/COM.HACKERS.WWW
<http://www.bank.com/COM.HACKERS.WWW>.
Furthermore,
http://WWW.HACKERS.COM?path/boring/and/long/very/a/com.bank.www//:http
<http://WWW.HACKERS.COM/?path/boring/and/long/very/a/com.bank.www//:http> would
be displayed as
http://www.bank.com/a/very/long/and/boring/path?MOC.SREKCAH.WWW//:http,
the same as
http://www.bank.com/a/very/long/and/boring/path?COM.HACKERS.WWW//:http.
o *Proposal 1: UBA Extension -*/the “fields” of an IRI to flow
in a consistent direction./
+ A /separator/ is defined as any instance of the quoted
strings in the bidi_IRI BNF:
right after a scheme: “://”
in a domain: IDNSep
right after a domain: “/” , “?”, “#”
in a path: “/”
right after a path: “?”, “#”
in a query: “=” or “&”
right after a query: “#”
+ A /field/ is defined as any text between separators,
or at the front or end.
+ Ordering options:
1. Each bidi_IRI is displayed with fields from left
to right. Thus the following will always appear
with the same display, whether in a RTL or LTR
environment.
http://ab.cd.com/mn/op
http://ab.cd.*FE.HG.*com/*JI/LK/*mn/op
http://*FE.HG/JI/LK*
2. the ordering of fields could be subject to the
environment (whether the current embedding level
is RTL or LTR). In that case, the display would
be something like:
LTR:
http://ab.cd.com/mn/op
http://ab.cd.*FE.HG.*com/*JI/LK/*mn/op
http://FE.HG/JI/LK
RTL:
op/mn/com.cd.ab//:http
op/mn*/LK/JI*/com*.HG.FE*.cd.ab//:http
*LK/JI/HG.FE*//:http
3. the ordering not depend only on the environment,
but instead depend on whether there were any RTL
characters in the IRI.
http://ab.cd.com/mn/op
op/mn*/LK/JI*/com*.HG.FE*.cd.ab//:http
*LK/JI/HG.FE*//:http
+ Method:
# the entire bidi_IRI is embedded in <LRE>...<PDF>
# each field is surrounded by LRMs or RLMs
depending on the main direction.
o Definition of a Bidi-IRI
+ characters
*bidiIri* := ((scheme “://” domain) | domain2)
(“/” path)? (“?” query)? (“#” fragment)?
*domain* := UTS46Chars + ( IDNSep UTS46Chars+)*
IDNSep?
*domain2* := domain IDNSep TLD IDNSep?
*path* := (char - “?” - “#”)*
*query* := (char - “#”)*
*fragment* := char*
*IDNSep* := [\u002E \uFF0E \u3002\uFF61] // see
http://unicode.org/reports/tr46/#Notation
<http://unicode.org/reports/tr46/#Notation>
*TLD* := <list on
http://www.iana.org/domains/root/db/>
*char* := percentEncodedUTF8
|
[[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:]
inclusionChar - exclusionChar]
*inclusionChar* :=
U+0021
<http://unicode.org/cldr/utility/character.jsp?a=0021> (
! ) EXCLAMATION MARK
U+0022
<http://unicode.org/cldr/utility/character.jsp?a=0022> (
" ) QUOTATION MARK
U+0023
<http://unicode.org/cldr/utility/character.jsp?a=0023> (
# ) NUMBER SIGN
U+0025
<http://unicode.org/cldr/utility/character.jsp?a=0025> (
% ) PERCENT SIGN
U+0026
<http://unicode.org/cldr/utility/character.jsp?a=0026> (
& ) AMPERSAND
U+0027
<http://unicode.org/cldr/utility/character.jsp?a=0027> (
' ) APOSTROPHE
U+002A
<http://unicode.org/cldr/utility/character.jsp?a=002A> (
* ) ASTERISK
U+002C
<http://unicode.org/cldr/utility/character.jsp?a=002C> (
, ) COMMA
U+002E
<http://unicode.org/cldr/utility/character.jsp?a=002E> (
. ) FULL STOP
U+002F
<http://unicode.org/cldr/utility/character.jsp?a=002F> (
/ ) SOLIDUS
U+003A
<http://unicode.org/cldr/utility/character.jsp?a=003A> (
: ) COLON
U+003B
<http://unicode.org/cldr/utility/character.jsp?a=003B> (
; ) SEMICOLON
U+003F
<http://unicode.org/cldr/utility/character.jsp?a=003F> (
? ) QUESTION MARK
U+0040
<http://unicode.org/cldr/utility/character.jsp?a=0040> (
@ ) COMMERCIAL AT
U+005C
<http://unicode.org/cldr/utility/character.jsp?a=005C> (
\ ) REVERSE SOLIDUS
U+00A1
<http://unicode.org/cldr/utility/character.jsp?a=00A1> (
¡ ) INVERTED EXCLAMATION MARK
U+00B7
<http://unicode.org/cldr/utility/character.jsp?a=00B7> (
· ) MIDDLE DOT
U+00BF
<http://unicode.org/cldr/utility/character.jsp?a=00BF> (
¿ ) INVERTED QUESTION MARK
*exclusionChar* :=
U+003C
<http://unicode.org/cldr/utility/character.jsp?a=003C> (
< ) LESS-THAN SIGN
U+003E
<http://unicode.org/cldr/utility/character.jsp?a=003E> (
> ) GREATER-THAN SIGN
+ termination
1.Unassigned, surrogates, private-use, control codes
Whitespace
Open, close or most ‘other’ punctuation, plus
special cases < and >.
U+003C
<http://unicode.org/cldr/utility/character.jsp?a=003C>
( < ) LESS-THAN SIGN
U+003E
<http://unicode.org/cldr/utility/character.jsp?a=003E>
( > ) GREATER-THAN SIGN
U+0028
<http://unicode.org/cldr/utility/character.jsp?a=0028>
( ( ) LEFT PARENTHESIS
U+0029
<http://unicode.org/cldr/utility/character.jsp?a=0029>
( ) ) RIGHT PARENTHESIS
U+005B
<http://unicode.org/cldr/utility/character.jsp?a=005B>
( [ ) LEFT SQUARE BRACKET
U+005D
<http://unicode.org/cldr/utility/character.jsp?a=005D>
( ] ) RIGHT SQUARE BRACKET
U+007B
<http://unicode.org/cldr/utility/character.jsp?a=007B>
( { ) LEFT CURLY BRACKET
U+007D
<http://unicode.org/cldr/utility/character.jsp?a=007D>
( } ) RIGHT CURLY BRACKET
U+00AB
<http://unicode.org/cldr/utility/character.jsp?a=00AB>
( « ) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00BB
<http://unicode.org/cldr/utility/character.jsp?a=00BB>
( » ) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
+ Continuation
1.Letters, Marks, Numbers
Dash and connector punctuation
Symbols (except terminating symbols)
o issues
+ TLD
# Is a TLD fixed or are custom patterns allowed?
+ Scheme
# require a scheme for Bidi-IRI recognizer
# enforce a direction based on the scheme
+ UBA Extension
# The same URL (IRI) will be displayed differently
according to the embedding level. This is confusing
# Pure Latin-character URLs will be displayed in a
new and strange way when the embedding level is
odd. For instance, "htttp://docs.google.com"
will be displayed as "com.google.docs//:http".
* feedback that is is actually preferred
* preferences seem to be dependent partially
on the user’s culture and partially on
other life experiences
+ define "mostly Latin" and "mostly Arabic or Hebrew".
# first strong in the domain name?
o Proposals
+ Enforce Direction on the basis of the Domain language
# all of domain?
# part of domain?
+ * always order the labels/fields either from left to
right or right to left.
* pick the initial direction from the user environment
(eg: English gets left to right fields, Arabic gets
right to left fields).
* allow the user to override the direction in their
preferences.
+ *locale-based ordering*
# Eg: visiting an en-US web page may have a
different behavior than an ar-EG web page
# get all-rtl iris displayed rtl overall, not in a
constant back-and-forth at every separator. This
should be based on the presence of rtl in the
domain name
On 06/06/2011 13:53, Larry Masinter wrote:
>
> Could someone summarize the requirements for BIDI representation and
> display, and the design choices we’re facing and how they match up
> against the requirements?
>
Received on Tuesday, 7 June 2011 13:14:36 UTC