Re: Proposed resolution of HRRI/IRI discussion from Addison Phillips on 2007-09-18 (public-xml-core-wg@w3.org from September 2007)

From: Addison Phillips <addison@yahoo-inc.com>
Date: Tue, 18 Sep 2007 08:55:06 -0700
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
CC: public-i18n-core@w3.org, public-xml-core-wg@w3.org
Message-ID: <46EFF4DA.8040608@yahoo-inc.com>
Hi Henry,

First, let me express my regret for not responding to this message 
earlier. While the working group addressed this issue a couple of weeks 
ago, I have been caught up with travel and some personal events that 
have interfered with my response. I apologize for the delay, as I know 
your WG is keen to resolve this issue.

The Internationalization Core WG has discussed your proposal (below) 
regarding "LEIRIs". This discussion occurred in our teleconference of 4 
September as is recorded here:

  http://www.w3.org/2007/09/04-core-minutes.html#item04

We also contacted Martin Dürst regarding its incorporation into the IRI 
specification. He has agreed a) that this is a Good Thing and b) that he 
will incorporate it at his earliest convenience into the draft, which he 
is working to complete in a timely manner.

At present I do not have a specific schedule for the completion of the 
new version of IRI from Martin, but I will attempt to secure one in the 
near future. I understand that he is just back from vacation/summer, and 
so is just coming back up to speed on working on this project.

If you have comments or questions about this resolution, I'd be happy to 
address them. Again, my apologies for not communicating more quickly 
with your WG. We all very much appreciate your patience.

Best Regards (for I18N Core),

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.


> 
> We would like to suggest that the best way to move forward with our
> effort to reconcile the differences between the way in which various
> specifications in the XML family allow a superset of IRIs, and the
> IRI spec. itself, would be to incorporate a new section in the
> revision of the IRI spec. that you are currently working on, which
> would name and define a single concept to be referenced from all
> those XML specs, along the following lines:
> 
> Name (negotiable): Legacy Extended IRIs (LEIRIs)
> 
> Definition (based on [1], with subsequent additions):
> 
>  A Legacy Extended International Resource Identifier (LEIRI) is a
>  sequence of Unicode characters that can be converted into an IRI by
>  the application of a few simple encoding rules.
> 
>  To convert a Legacy Extended International Resource Identifier to
>  an IRI reference, the following characters MUST be percent encoded
>  by applying steps 2.1 to 2.3 of Section 3.1:
> 
>   * space #x20
>   * the delimiters "<" #x3C, ">" #x3E, and '"' #x22
>   * the unwise characters "{" #x7B, "}" #x7D, "|" #x7C, "\" #x5C,
>     "^" #x5E, and "`" #x60
>   * The 'unreasonable' characters:
> 
>                #x0  - #x1F |         /* C0 controls */
>                #x7F - #x9F |         /* DEL and C1 controls */
>                #x200E | #x200F | #x202A-E /* Bidi formatting characters */
>                #xE000 - #xF8FF |     /* private use */
>                #xFDD0 - #xFDEF |     /* non-characters */
>                #x1FFFE - #x1FFFF |   /* non-characters */
>                [similar lines for every planes from 2 -- F]
>                #x10FFFE - #x10FFFF | /* non-characters */
>                #xE0000 - #xE0FFF |   /* tags - I don't understand these */
>                #xF0000 - #xFFFFD |   /* private use */
>                #x100000 - #x10FFFD   /* private use */
> 
> Health Warning: We would be happy to see some text added to warn
>  against creating new LEIRIs using most or indeed almost all of the
>  characters allowed by this, perhaps expanding on what is already
>  present in [1]: "[A]uthors of [LEIRI]s are advised to percent
>  encode space characters themselves, rather than rely on the
>  processor to do so, because spaces are often used to separate
>  [LEIRI]s in a sequence."
> 
> Security considerations, to be added to section 8:
> 
>  Additional risks resulting from the additional characters allowed
>  in LEIRIs include:
> 
>   - Some characters may not be permitted by the context.  For
>   example, NUL characters are not allowed XML documents.
> 
>   - The use of control characters and bidirectional formatting
>   characters may allow malicious users to manipulate the displayed
>   version of an LEIRI.
> 
>   - Control characters and non-characters, or LEIRIs containing them,
>   may be filtered out by receivers.
> 
>   - Private use characters are not interoperable and may have
>   unpredictable effects.
> 
>   - Whitespace characters may be subject to normalization in certain
>   contexts.  For example, line endings in XML are normalized to LF;
>   tabs in XML attributes are converted to spaces; and sequences of
>   spaces are collapsed in tokenized XML attributes.
> 
>   - Some characters may be treated as delimiters in some contexts.
>   For example, spaces are often used to separate resource
>   identifiers in a sequence, and angle brackets are often used to
>   delimit resource identifiers in text.
> 
>  Legacy Extended International Resource Identifiers are often converted
>  to IRIs or URIs and subsequently used to provide a compact set of
>  instructions for access to network resources, care must be taken to
>  properly interpret the data within a Legacy Extended International
>  Resource Identifier, to prevent that data from causing unintended
>  access, and to avoid including data that should not be revealed in
>  plain text. [this para. probably overlaps somewhat with material
>  already present in section 8, it's here just as a starting point]
> 
> - ---------
> 
> We would expect to go ahead and publish several specs. which are
> waiting for a resolution of this issue, e.g. XML Base 2e and XLink
> 1.1, once there is a stable and agreed-final Internet Draft of a new
> edition of 3987 including agreed prose along the lines given above,
> leaving the insertion of the final RFC number to subsequent errata.
> 
> Please let us know what you think.
> 
> Henry S. Thompson
> Richard Tobin
> on behalf of the XML Core Working Group
> 
> [1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html
> - -- 
>  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
>                      Half-time member of W3C Team
>     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
>             Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
>                    URL: http://www.ltg.ed.ac.uk/~ht/
Received on Tuesday, 18 September 2007 15:56:08 UTC