Re: [charmodReview-17] replacing all URIs with IRIs from Misha.Wolf@reuters.com on 2002-05-24 (www-tag@w3.org from May 2002)

From: <Misha.Wolf@reuters.com>
Date: Fri, 24 May 2002 23:04:43 +0100
To: Aaron Swartz <me@aaronsw.com>
Cc: www-tag@w3.org
Message-ID: <T5b1123dd2bc407b7078d8@reuters.com>
On 24/05/2002 20:58:57 Aaron Swartz wrote:
> I would like to draw the TAG's attention to this requirement in charmod:
>
> """
> W3C specifications that define protocol or format elements (e.g. HTTP
> headers, XML attributes, etc.) which are to be interpreted as URI
> references (or specific subsets of URI references, such as absolute URI
> references, URIs, etc.) SHOULD use Internationalized Resource
> Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof).
> """
>   - http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-URIs
>
> RDF, for example, has recently moved to replace URIs with IRIs (or
> something like them). I find this seriously problematic since it will

Are you aware that a number of W3C specifications already support IRIs,
though not under that name?  Examples are XML, XML Schema, XPointer and
XLink.  For example, the XML specification states[1]:

|  System identifiers (and other XML strings meant to be used as URI
|  references) may contain characters that, according to [IETF RFC 2396]
|  and [IETF RFC 2732], must be escaped before a URI can be used to
|  retrieve the referenced resource. The characters to be escaped are the
|  contol characters #x0 to #x1F and #x7F (most of which cannot appear in
|  XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the
|  unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and
|  '`' #x60, as well as all characters above #x7F. Since escaping is not
|  always a fully reversible process, it must be performed only when
|  absolutely necessary and as late as possible in a processing chain.

The last sentence above is very important.  Note both the "only when
absolutely necessary" and the "as late as possible in a processing
chain".

The XML specification continues:

|  In particular, neither the process of converting a relative URI to an
|  absolute one nor the process of passing a URI reference to a process or
|  software component responsible for dereferencing it should trigger
|  escaping. When escaping does occur, it must be performed as follows:
|
|  1. Each disallowed character to be escaped is represented in UTF-8
|     [IETF RFC 2279] as one or more bytes.
|
|  2. The resulting bytes are escaped with the URI escaping mechanism
|     (that is, converted to %HH, where HH is the hexadecimal notation of
|     the byte value).
|
|  3. The original character is replaced by the resulting character sequence.

> break many utilities which have made the assumption that RDF identifiers

Which utilities?

> are ASCII strings with no spaces, etc.

I suppose spaces could safely be converted to %20, as they are
invertible.

> I can understand presenting strings this way for user-display and
> user-entry but storing them this way and making them the official
> encoding seems to be going too far. I would think that simply using
> UTF-8 %-encoding would be fine for these purposes.

How do you propose to display these strings in a meaningful manner?
%HH encoding is not invertible, except in the case of ASCII characters.
This is because the character encoding is not, in general, known.
RFC 2396 says[2]:

|  In the simplest case, the original character sequence contains only
|  characters that are defined in US-ASCII, and the two levels of
|  mapping are simple and easily invertible: each 'original character'
|  is represented as the octet for the US-ASCII code for it, which is,
|  in turn, represented as either the US-ASCII character, or else the
|  "%" escape sequence for that octet.
|
|  For original character sequences that contain non-ASCII characters,
|  however, the situation is more difficult. Internet protocols that
|  transmit octet sequences intended to represent character sequences
|  are expected to provide some way of identifying the charset used, if
|  there might be more than one [RFC2277].  However, there is currently
|  no provision within the generic URI syntax to accomplish this
|  identification. An individual URI scheme may require a single
|  charset, define a default charset, or provide a way to indicate the
|  charset used.

> What does the TAG think about changing the standard Web identifier from
> URIs to IRIs, essentially allowing arbitrary Unicode characters into the
> body of these identifiers. An example from the RDF test cases shows an
> HTTP URI with embedded accented characters in Unicode.
>
> I'm considering appealing this decision, but I wanted to hear the TAG's
> position first,
>
> Thanks,
> --
> Aaron Swartz [http://www.aaronsw.com/]

[1] http://www.w3.org/XML/xml-V10-2e-errata#E26
[2] http://www.ietf.org/rfc/rfc2396.txt

Misha Wolf
I18N WG Chair





------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.
Received on Friday, 24 May 2002 18:06:28 UTC