Re: I18N (was: Closing rdfms-difference-between-ID-and-about) from Jeremy Carroll on 2001-10-17 (w3c-rdfcore-wg@w3.org from October 2001)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Wed, 17 Oct 2001 18:07:31 +0100
To: w3c-rdfcore-wg@w3.org
Message-ID: <3BCDBAD3.D95E1CD6@hplb.hpl.hp.com>
Jeremy:
> >I owe the group some work on internationalization, but currently am of
> >the opinion that we should allow internationalized URIs wherever we
> >allow uri-references; these being resolved into US-ASCII URIs ala RFC
> >2396 (as modified by RFC 2732) at the same time as relative URI's are
> >resolved, using the standard algorithm.

Graham:
> 
> I've been in discussions with I18N group about URIs in CC/PP (which is an
> application of RDF).
> 
> Basically, their position (as I understand it) is that URIs in an XML
> document should be regarded as an "original character sequence" rather than
> a "URI character sequence" (see RFC 2396, section 2.1).  Then, when a URI
> is dereferenced, or otherwise required in "URI character sequence" form,
> the appropriate transformation to an octet sequence is performed (dependent
> on the code point set used for the XML document), and then URI escaping
> (%hh) is applied to yield a "URI character sequence".  If the XML document
> uses Unicode characters, then the required octet encoding would be UTF-8,
> which provides an unambiguous interpretation for URIs in XML.  If other
> character sets are being used, then the interpretation is subject to
> application interpretation, but I presume that use of non-Unicode codepoint
> sets is generally discouraged for new data.

Internally XML documents are in Unicode, even if their serialization is
in some other charset the text has been converted to unicode before we
get to worrying about URI's and IURI's. In practice, I understood the
position to be that IURIs work with UTF-8 as the encoding. If you have a
IURI which is not UTF-8 encoded then you still have to do the %HH
encoding by hand. (This happens in particular with URLs).

> 
> There is some language in the XML linking spec
> (http://www.w3.org/TR/xlink/#link-locators) that I am planning to adapt for
> the CC/PP spec:

I am cautious about this text ...

> [[[
> The value of the href attribute must
> be a URI reference as defined in [IETF RFC 2396], or must result in
> a URI reference after the escaping procedure described below is applied. The
> procedure is applied when passing the URI reference to a URI resolver.
> 
> Some characters are disallowed in URI references, even if they are allowed
> in XML; the disallowed characters include all non-ASCII characters, plus the
> excluded characters listed in Section 2.4 of [IETF RFC 2396], except
> for the number sign (#) and percent sign (%) and the square bracket characters
> re-allowed in [IETF RFC 2732]. Disallowed characters must
> be escaped as follows:
> 
> - Each disallowed character is converted to UTF-8 [IETF RFC 2279]
> as one or more bytes.
> 
> - Any bytes corresponding to a disallowed character are escaped with
> the URI escaping mechanism (that is, converted to %HH,
> where HH is the hexadecimal notation of the byte value).
> 
> - The original character is replaced by the resulting character sequence.
> ]]]
> 


While essentially that is *the* algorithm I suggest highlighting the
following points.

*Any* string for which a corresponding URI ref is needed is subject to
the URI reference escaping procedure.

All the disallowed characters are escaped except for the number sign (#)
and percent sign (%) and the square bracket characters re-allowed in
[IETF RFC 2732].

*Every* % must be followed by two hexadecimal digits which are
*normalized to upper case* in the escaping procedure. A string with a %
that is not followed by two hexadecimal digits is not a valid URI and
cannot be converted into one.

> - Each disallowed character is converted to UTF-8 [IETF RFC 2279]
> as one or more bytes.
> 
> - Any bytes corresponding to a disallowed character are escaped with
> the URI escaping mechanism (that is, converted to %HH,
> where HH is the hexadecimal notation of the byte value
*the uppercase* hexadecimal digits are used).


Differences I highlighted are:
+ clarity that all URI input is passed through the escaping algorithm,
[it leaves US-ASCII URIs unchanged so is harmless]
+ clarity that uppercase hexadecimal digits are used; makes it easier
for RDF where we want to be able to have (US-ASCII) URIs in the model,
and be able to binary compare them.
+ clarity that the responsibility for uppercase hex lies with the RDF
processor not the input document.
+ clarity about what happens with %

The algorithm is idempotent (i.e. you can apply it twice or three times
and it's the same as applying it once). This means that if a document
author for whatever reason chooses to use traditional URIs everything
works fine, and will interoperate with another author using IURIs.



===

Furthermore, I think this goes in the RDF/XML syntax WD, and as far as
the model goes a URI is an RFC 2396/2732 URI. The syntax WD should
specify early application of this algorithm, for instance before
aboutEach processing.

Jeremy
Received on Wednesday, 17 October 2001 13:02:59 UTC