I18N (was: Closing rdfms-difference-between-ID-and-about) from Graham Klyne on 2001-10-17 (w3c-rdfcore-wg@w3.org from October 2001)

From: Graham Klyne <Graham.Klyne@MIMEsweeper.com>
Date: Wed, 17 Oct 2001 10:21:12 +0100
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <5.1.0.14.2.20011017100127.043fd0c0@joy.songbird.com>

At 09:35 AM 10/17/01 +0100, Jeremy Carroll wrote:
[...]
>Dave's text is currently neutral vis-a-vis internationalized URI's.
>Aaron's text takes a URI-ref to be precisely a US-ASCII URI ala RFC
>2396.
>
>I owe the group some work on internationalization, but currently am of
>the opinion that we should allow internationalized URIs wherever we
>allow uri-references; these being resolved into US-ASCII URIs ala RFC
>2396 (as modified by RFC 2732) at the same time as relative URI's are
>resolved, using the standard algorithm.

I've been in discussions with I18N group about URIs in CC/PP (which is an 
application of RDF).

Basically, their position (as I understand it) is that URIs in an XML 
document should be regarded as an "original character sequence" rather than 
a "URI character sequence" (see RFC 2396, section 2.1).  Then, when a URI 
is dereferenced, or otherwise required in "URI character sequence" form, 
the appropriate transformation to an octet sequence is performed (dependent 
on the code point set used for the XML document), and then URI escaping 
(%hh) is applied to yield a "URI character sequence".  If the XML document 
uses Unicode characters, then the required octet encoding would be UTF-8, 
which provides an unambiguous interpretation for URIs in XML.  If other 
character sets are being used, then the interpretation is subject to 
application interpretation, but I presume that use of non-Unicode codepoint 
sets is generally discouraged for new data.

There is some language in the XML linking spec 
(http://www.w3.org/TR/xlink/#link-locators) that I am planning to adapt for 
the CC/PP spec:
[[[
The value of the href attribute must
be a URI reference as defined in [IETF RFC 2396], or must result in
a URI reference after the escaping procedure described below is applied. The
procedure is applied when passing the URI reference to a URI resolver.

Some characters are disallowed in URI references, even if they are allowed
in XML; the disallowed characters include all non-ASCII characters, plus the
excluded characters listed in Section 2.4 of [IETF RFC 2396], except
for the number sign (#) and percent sign (%) and the square bracket characters
re-allowed in [IETF RFC 2732]. Disallowed characters must
be escaped as follows:

- Each disallowed character is converted to UTF-8 [IETF RFC 2279]
as one or more bytes.

- Any bytes corresponding to a disallowed character are escaped with
the URI escaping mechanism (that is, converted to %HH,
where HH is the hexadecimal notation of the byte value).

- The original character is replaced by the resulting character sequence.
]]]

I contend that this approach is reasonable, but not currently documented in 
any W3C
Recommendation in such a way that suggests that it applies to any URI in an 
XML document.  Notwithstanding, I predict that this is how I18N will 
strongly request we adopt this approach (at least for rdf:about and 
rdf:resource).

[Later:  I note that the XML schema anyURI datatype 
(http://www.w3.org/TR/xmlschema-2/#anyURI) refers to the XML Linking 
language quoted above.  If we said that the attribute values of rdf:about, 
rdf:resource were 'anyURI' per XML schema datatypes, I think the rest would 
follow.]

#g

------------------------------------------------------------
Graham Klyne                    MIMEsweeper Group
Strategic Research              <http://www.mimesweeper.com>
<Graham.Klyne@MIMEsweeper.com>
------------------------------------------------------------

Received on Wednesday, 17 October 2001 10:41:27 UTC