Re: [charmodReview-17] replacing all URIs with IRIs

On 24/05/2002 20:58:57 Aaron Swartz wrote:
> I would like to draw the TAG's attention to this requirement in charmod:
> """
> W3C specifications that define protocol or format elements (e.g. HTTP
> headers, XML attributes, etc.) which are to be interpreted as URI
> references (or specific subsets of URI references, such as absolute URI
> references, URIs, etc.) SHOULD use Internationalized Resource
> Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof).
> """
-
> RDF, for example, has recently moved to replace URIs with IRIs (or
> something like them). I find this seriously problematic since it will

Are you aware that a number of W3C specifications already support IRIs,
though not under that name?  Examples are XML, XML Schema, XPointer and
XLink.  For example, the XML specification states[1]:

|  System identifiers (and other XML strings meant to be used as URI
|  references) may contain characters that, according to [IETF RFC 2396]
|  and [IETF RFC 2732], must be escaped before a URI can be used to
|  retrieve the referenced resource. The characters to be escaped are the
|  contol characters #x0 to #x1F and #x7F (most of which cannot appear in
|  XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the
|  unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and
|  '`' #x60, as well as all characters above #x7F. Since escaping is not
|  always a fully reversible process, it must be performed only when
|  absolutely necessary and as late as possible in a processing chain.

The last sentence above is very important.  Note both the "only when
absolutely necessary" and the "as late as possible in a processing

The XML specification continues:

|  In particular, neither the process of converting a relative URI to an
|  absolute one nor the process of passing a URI reference to a process or
|  software component responsible for dereferencing it should trigger
|  escaping. When escaping does occur, it must be performed as follows:
|  1. Each disallowed character to be escaped is represented in UTF-8
|     [IETF RFC 2279] as one or more bytes.
|  2. The resulting bytes are escaped with the URI escaping mechanism
|     (that is, converted to %HH, where HH is the hexadecimal notation of
|     the byte value).
|  3. The original character is replaced by the resulting character sequence.

> break many utilities which have made the assumption that RDF identifiers

Which utilities?

> are ASCII strings with no spaces, etc.

I suppose spaces could safely be converted to %20, as they are

> I can understand presenting strings this way for user-display and
> user-entry but storing them this way and making them the official
> encoding seems to be going too far. I would think that simply using
> UTF-8 %-encoding would be fine for these purposes.

How do you propose to display these strings in a meaningful manner?
%HH encoding is not invertible, except in the case of ASCII characters.
This is because the character encoding is not, in general, known.
RFC 2396 says[2]:

|  In the simplest case, the original character sequence contains only
|  characters that are defined in US-ASCII, and the two levels of
|  mapping are simple and easily invertible: each 'original character'
|  is represented as the octet for the US-ASCII code for it, which is,
|  in turn, represented as either the US-ASCII character, or else the
|  "%" escape sequence for that octet.
|  For original character sequences that contain non-ASCII characters,
|  however, the situation is more difficult. Internet protocols that
|  transmit octet sequences intended to represent character sequences
|  are expected to provide some way of identifying the charset used, if
|  there might be more than one [RFC2277].  However, there is currently
|  no provision within the generic URI syntax to accomplish this
|  identification. An individual URI scheme may require a single
|  charset, define a default charset, or provide a way to indicate the
|  charset used.

> What does the TAG think about changing the standard Web identifier from
> URIs to IRIs, essentially allowing arbitrary Unicode characters into the
> body of these identifiers. An example from the RDF test cases shows an
> HTTP URI with embedded accented characters in Unicode.
> I'm considering appealing this decision, but I wanted to hear the TAG's
> position first,
> Thanks,
--
Aaron Swartz []


Misha Wolf
I18N WG Chair

