FW: Proposed resolution of HRRI/IRI discussion

Chaps, Martin,

This is a formal proposal from the XML Core WG to move forward the HRRI
question by incorporating text into the IRI spec revision.  Note, in
particular, that they are willing to point to the "IRI spec or its
successor" pending publication of the next version, so that things can move
forward on the XML front in the meantime. 

I think this is a major step forward, but we will need to be seen as
proactive in moving things forward to maintain their confidence.

There are likely to be some small details to iron out, but I think we need
to keep those separate from discussion/decisions on the major questions.
These are likely to be (a) do we agree to incorporate some text in the IRI
spec to address LEIRIs? and (b) can we reach agreeable wording on the issue
of spaces?  (On the last point, I think we all see eye to eye, and the XML
folks are trying to shore up a problem as best one can that is already out
there.)

I suggest we discuss this at the next telecon.  It may be helpful if Martin
were able to attend ?

RI



============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/People/Ishida/
http://www.w3.org/International/
http://people.w3.org/rishida/blog/
http://www.flickr.com/photos/ishida/
 

-----Original Message-----
From: public-i18n-core-request@w3.org
[mailto:public-i18n-core-request@w3.org] On Behalf Of Henry S. Thompson
Sent: 29 August 2007 17:36
To: public-i18n-core@w3.org
Cc: public-xml-core-wg@w3.org
Subject: Proposed resolution of HRRI/IRI discussion


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

We would like to suggest that the best way to move forward with our effort
to reconcile the differences between the way in which various specifications
in the XML family allow a superset of IRIs, and the IRI spec. itself, would
be to incorporate a new section in the revision of the IRI spec. that you
are currently working on, which would name and define a single concept to be
referenced from all those XML specs, along the following lines:

Name (negotiable): Legacy Extended IRIs (LEIRIs)

Definition (based on [1], with subsequent additions):

 A Legacy Extended International Resource Identifier (LEIRI) is a  sequence
of Unicode characters that can be converted into an IRI by  the application
of a few simple encoding rules.

 To convert a Legacy Extended International Resource Identifier to  an IRI
reference, the following characters MUST be percent encoded  by applying
steps 2.1 to 2.3 of Section 3.1:

  * space #x20
  * the delimiters "<" #x3C, ">" #x3E, and '"' #x22
  * the unwise characters "{" #x7B, "}" #x7D, "|" #x7C, "\" #x5C,
    "^" #x5E, and "`" #x60
  * The 'unreasonable' characters:

               #x0  - #x1F |         /* C0 controls */
               #x7F - #x9F |         /* DEL and C1 controls */
               #x200E | #x200F | #x202A-E /* Bidi formatting characters */
               #xE000 - #xF8FF |     /* private use */
               #xFDD0 - #xFDEF |     /* non-characters */
               #x1FFFE - #x1FFFF |   /* non-characters */
               [similar lines for every planes from 2 -- F]
               #x10FFFE - #x10FFFF | /* non-characters */
               #xE0000 - #xE0FFF |   /* tags - I don't understand these */
               #xF0000 - #xFFFFD |   /* private use */
               #x100000 - #x10FFFD   /* private use */

Health Warning: We would be happy to see some text added to warn  against
creating new LEIRIs using most or indeed almost all of the  characters
allowed by this, perhaps expanding on what is already  present in [1]:
"[A]uthors of [LEIRI]s are advised to percent  encode space characters
themselves, rather than rely on the  processor to do so, because spaces are
often used to separate  [LEIRI]s in a sequence."

Security considerations, to be added to section 8:

 Additional risks resulting from the additional characters allowed  in
LEIRIs include:

  - Some characters may not be permitted by the context.  For
  example, NUL characters are not allowed XML documents.

  - The use of control characters and bidirectional formatting
  characters may allow malicious users to manipulate the displayed
  version of an LEIRI.

  - Control characters and non-characters, or LEIRIs containing them,
  may be filtered out by receivers.

  - Private use characters are not interoperable and may have
  unpredictable effects.

  - Whitespace characters may be subject to normalization in certain
  contexts.  For example, line endings in XML are normalized to LF;
  tabs in XML attributes are converted to spaces; and sequences of
  spaces are collapsed in tokenized XML attributes.

  - Some characters may be treated as delimiters in some contexts.
  For example, spaces are often used to separate resource
  identifiers in a sequence, and angle brackets are often used to
  delimit resource identifiers in text.

 Legacy Extended International Resource Identifiers are often converted  to
IRIs or URIs and subsequently used to provide a compact set of  instructions
for access to network resources, care must be taken to  properly interpret
the data within a Legacy Extended International  Resource Identifier, to
prevent that data from causing unintended  access, and to avoid including
data that should not be revealed in  plain text. [this para. probably
overlaps somewhat with material  already present in section 8, it's here
just as a starting point]

- ---------

We would expect to go ahead and publish several specs. which are waiting for
a resolution of this issue, e.g. XML Base 2e and XLink 1.1, once there is a
stable and agreed-final Internet Draft of a new edition of 3987 including
agreed prose along the lines given above, leaving the insertion of the final
RFC number to subsequent errata.

Please let us know what you think.

Henry S. Thompson
Richard Tobin
on behalf of the XML Core Working Group

[1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html
- --
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me
_always_ has this .sig -- mail without it is forged spam] -----BEGIN PGP
SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFG1aCDkjnJixAXWBoRAiKpAJ9BWe0SrU9QkCPG5phngZlYiFuH7wCfZ/+0
USkAyTaC0htugqXPYWEp7rA=
=B9sV
-----END PGP SIGNATURE-----

Received on Friday, 31 August 2007 12:56:00 UTC