HRRIs - lists of characters

As Martin quite rightly points out, many of the characters allowed in
HRRIs but not in IRIs are really poor choices, and though we have to
allow them for compatibility with the existing specs, we should
certainly discourage their use.  We also have to add various
characters pointed out by Martin to the list of characters that must
be escaped.

We should perhaps also list explicitly the non-characters such as
surrogates that cannot occur in HRRIs.  These are of course not
allowed by XML, but we don't want to make the definition of HRRI
depend on the definition of XML.

I suggest we add three productions, "reasonable", "unreasonable",
and "disallowed", listing these characters.

What follows should replace the current list of characters to be %-encoded.

  reasonable = #x20 | "<" | ">" | #x22 | "{" | "}" | "|" | "\" | "^" | "`"

  unreasonable = #x0  - #x1F |         /* C0 controls */
                 #x7F - #x9F |         /* DEL and C1 controls */
                 #xE000 - #xF8FF |     /* private use */
                 #xFDD0 - #xFDEF |     /* non-characters */
                 #x1FFFE - #x1FFFF |   /* non-characters */
                 ...
                 #x10FFFE - #x10FFFF | /* non-characters */
                 #xE0000 - #xE0FFF |   /* tags - I don't understand these */
                 #xF0000 - #xFFFFD |   /* private use */
                 #x100000 - #x10FFFD   /* private use */

  disallowed = #xD800 - #xDFFF |       /* surrogates */
               #xFFFE | #xFFFF

The disallowed characters must not occur in HRRIs.  The reasonable and
unreasonable characters may, though they may be unavailable for other
reasons - for example, #x0 is not allowed in XML.  The use of the
unreasonable character is discouraged, and their use may have security
implications.

To convert an HRRI to an IRI reference, the reasonable and
unreasonable characters must be %-encoded, except for private use
characters appearing in the query part.

-- Richard

Received on Tuesday, 26 June 2007 09:32:04 UTC