IURIs, URIs, CHARMOD from Jeremy Carroll on 2001-09-24 (www-i18n-comments@w3.org from September 2001)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 24 Sep 2001 10:51:47 +0100
To: www-i18n-comments@w3.org
Message-ID: <3BAF0233.D44FD91A@hplb.hpl.hp.com>
I have been reading charmod to try and understand what the RDF core
working 
group should do vis-a-vis internationalization.

I am multiply posting this to uri@w3.org and www-i18n-comments@w3.org
(It is unclear to me which is the more appropriate forum).

I had some comments on the URI encoding.

These are specifically motivated by the problem of international URIs in
XML Namespace declarations, and I think, give a workable proposal for
how
to resolve the need to escape the forbidden US-ASCII characters in some
contexts but not in others, as well as providing a backwardly compatible
migration path from URI to IRI.

I first give my sense of the problem domain, than my proposal along with 
two algorithms, then some comment.

All refs to [RFC 2396] include the extension given by [RFC 2732] 

The CHARMOD ref is the last call working draft from early this year.
http://www.w3.org/TR/2001/WD-charmod-20010126

==============

Problem statement.

(A) When internationalizing URIs, in some contexts it is preferred to 
    continue to exclude the excluded charcaters of RFC 2396 section
2.4.3,
    in other contexts (e.g. XML attributes), other escaping mechanism
    make this unnecessary.

(B) In many contexts there is a significant legacy problem, e.g. XML
    Namespace declarations. Here the defining specifications require the
    use of URIs from RFC 2396. In practice, applications copy strings
    without processing them, and hence work just as well with any other
    (internationalized) URI specification. But if a document contains
    an arbitrary string how should it be encoded. Should a URI-escaping 
    algorithm be applied or not.

Insight:
   Actually URI escaping is not problematic if we never escape '%'.
   In this case we can apply URI-escaping to an already escaped, or
   to a partially escaped, URI and get the right answer. Thus the 
   only requirement on URI authors is that they escape literal '%' 
   as "%25"; they may escape any other characters (although it is
   unwise to escape the US-ASCII unreserved characters).

==================

Proposal:
   An IRI is any string in any encoding such that:
   + every '%' is followed by two hexadecimal characters '0' - '9' 
     and 'a' - 'f' and 'A' - 'F'
   + applying the IRI encoding algorithm below creates an RFC 2396 
     URI.
   
      NOTE: representing a literal '%' character in an IRI is 
      usually done using the string '%25'. All other characters
      can usually be represented as themselves. 


IRI encoding algorithm (slightly modified from 
draft-masinter-url-i18n-05.txt)
 
1) Represent the IURI characters as a sequence of ISO 10646
   characters.
2) If the original encoding was not UCS-based, normalize the character 
   sequence according to Normalization Form C as defined in [UNI15] 
   and [IETFNorm].
3) For each character that is syntactically not allowed by the
   generic URI syntax (all non-ASCII characters, plus the excluded
   characters in [RFC 2396, Section 2.4.3] except "#" and "%"
   (and "[" and "]"), apply the following:
   3.1) Convert the character to a sequence of one or more bytes
        using UTF-8 [RFC 2279].
   3.2) Escape each of the bytes in the sequence with the URI
        escaping mechanism [RFC 2396, Section 2.4.1] (i.e. convert
        each byte to %HH, where HH is the hexadecimal notation of
        the byte value using upper case 'A' - 'F' and not lower case
        'a' - 'f').
   3.3) Replace the original non-allowed character by the resulting
        character sequence.
4) For each '%':
   - if it is followed by two characters from the set:
     { '0' - '9', 'a' - 'f', 'A' - 'F' }
     replace by the upper case variants of the two characters.
   - otherwise flag an error.


IRI decoding algorithm (for use for human display - not machine 
processing)

1) Leave each sequence "%25" unchanged.
2) For every other '%' replace by the byte value indicated by
   the following two hexadecimal digits.
3) Leave every other character unchanged.

The output MAY be a UTF-8 encoded string.

=============================

Observations
------------

By clarifying the special role of '%' it is clear that the escaping
algorithm (which I believe is the usual one) is idempotent. That is,
we do not need to know that it has not been applied before, because
escaping an already escaped IRI has no effect.

It is useful (but not necessary) to be able to reverse such an encoding,
the IRI decoding algorithm does this. Notice the special treatment of
"%31" the encoding of '%'.

A server that needs to fetch a URL (for example) needs to use the normal
RFC2396 decoding algorithm in which '%31' is decoded as '% and the
character
encoding is defined by the server and is not in general UTF-8.

The 2nd step in the encoding algorithm clarifies the *early*
normalization
of UCS-based strings, in accordance with CHARMOD.

URIs that are encoded in a non-UTF-8 encoding can be used as IRI's under
this proposal; however the IRI decoding algorithm will work incorrectly.

Partially encoded IRIs (e.g. such as those encoding the unwise
characters of
RFC 2396) are IRIs under this proposal. Encoding the partially encoded
IRI
has the desired effect of creating a fully encoded IRI, that is
identical
to that reached by fully encoding the completed unencoded IRI.

Any URI (potentially with '%HH' sequences) is an IRI under this
proposal, 
and IRI-encoding it leaves it unchanged (except capitilizing any "%hh" 
pairs).

The clarification of step 3.2 in the algorithm to use upper case hex
digits
is intended to help in contexts such as RDF in which the binary
comparison
of (encoded) URIs is intended as the test for URI equality.

For URIs intended for servers using non-UTF8 encoding, the IRIs may
encode
% in a way other than %25, and may require encoding of many more
characters.
All such encodings will be compatible with RFC2396, and can then be 
IRI-encoded as in this proposal without changing them.



Jeremy Carroll
HP Labs
RDF Core Working Group (this message has NOT been discussed by this WG).
Received on Monday, 24 September 2001 05:47:28 UTC