Proposed resolution of HRRI/IRI discussion from Henry S. Thompson on 2007-08-29 (public-xml-core-wg@w3.org from August 2007)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Wed, 29 Aug 2007 17:36:19 +0100
To: public-i18n-core@w3.org
Cc: public-xml-core-wg@w3.org
Message-ID: <f5bhcmi1enw.fsf@hildegard.inf.ed.ac.uk>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

We would like to suggest that the best way to move forward with our
effort to reconcile the differences between the way in which various
specifications in the XML family allow a superset of IRIs, and the
IRI spec. itself, would be to incorporate a new section in the
revision of the IRI spec. that you are currently working on, which
would name and define a single concept to be referenced from all
those XML specs, along the following lines:

Name (negotiable): Legacy Extended IRIs (LEIRIs)

Definition (based on [1], with subsequent additions):

 A Legacy Extended International Resource Identifier (LEIRI) is a
 sequence of Unicode characters that can be converted into an IRI by
 the application of a few simple encoding rules.

 To convert a Legacy Extended International Resource Identifier to
 an IRI reference, the following characters MUST be percent encoded
 by applying steps 2.1 to 2.3 of Section 3.1:

  * space #x20
  * the delimiters "<" #x3C, ">" #x3E, and '"' #x22
  * the unwise characters "{" #x7B, "}" #x7D, "|" #x7C, "\" #x5C,
    "^" #x5E, and "`" #x60
  * The 'unreasonable' characters:

               #x0  - #x1F |         /* C0 controls */
               #x7F - #x9F |         /* DEL and C1 controls */
               #x200E | #x200F | #x202A-E /* Bidi formatting characters */
               #xE000 - #xF8FF |     /* private use */
               #xFDD0 - #xFDEF |     /* non-characters */
               #x1FFFE - #x1FFFF |   /* non-characters */
               [similar lines for every planes from 2 -- F]
               #x10FFFE - #x10FFFF | /* non-characters */
               #xE0000 - #xE0FFF |   /* tags - I don't understand these */
               #xF0000 - #xFFFFD |   /* private use */
               #x100000 - #x10FFFD   /* private use */

Health Warning: We would be happy to see some text added to warn
 against creating new LEIRIs using most or indeed almost all of the
 characters allowed by this, perhaps expanding on what is already
 present in [1]: "[A]uthors of [LEIRI]s are advised to percent
 encode space characters themselves, rather than rely on the
 processor to do so, because spaces are often used to separate
 [LEIRI]s in a sequence."

Security considerations, to be added to section 8:

 Additional risks resulting from the additional characters allowed
 in LEIRIs include:

  - Some characters may not be permitted by the context.  For
  example, NUL characters are not allowed XML documents.

  - The use of control characters and bidirectional formatting
  characters may allow malicious users to manipulate the displayed
  version of an LEIRI.

  - Control characters and non-characters, or LEIRIs containing them,
  may be filtered out by receivers.

  - Private use characters are not interoperable and may have
  unpredictable effects.

  - Whitespace characters may be subject to normalization in certain
  contexts.  For example, line endings in XML are normalized to LF;
  tabs in XML attributes are converted to spaces; and sequences of
  spaces are collapsed in tokenized XML attributes.

  - Some characters may be treated as delimiters in some contexts.
  For example, spaces are often used to separate resource
  identifiers in a sequence, and angle brackets are often used to
  delimit resource identifiers in text.

 Legacy Extended International Resource Identifiers are often converted
 to IRIs or URIs and subsequently used to provide a compact set of
 instructions for access to network resources, care must be taken to
 properly interpret the data within a Legacy Extended International
 Resource Identifier, to prevent that data from causing unintended
 access, and to avoid including data that should not be revealed in
 plain text. [this para. probably overlaps somewhat with material
 already present in section 8, it's here just as a starting point]

- ---------

We would expect to go ahead and publish several specs. which are
waiting for a resolution of this issue, e.g. XML Base 2e and XLink
1.1, once there is a stable and agreed-final Internet Draft of a new
edition of 3987 including agreed prose along the lines given above,
leaving the insertion of the final RFC number to subsequent errata.

Please let us know what you think.

Henry S. Thompson
Richard Tobin
on behalf of the XML Core Working Group

[1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html
- -- 
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFG1aCDkjnJixAXWBoRAiKpAJ9BWe0SrU9QkCPG5phngZlYiFuH7wCfZ/+0
USkAyTaC0htugqXPYWEp7rA=
=B9sV
-----END PGP SIGNATURE-----
Received on Wednesday, 29 August 2007 16:36:51 UTC