Re: I18N (was: Closing rdfms-difference-between-ID-and-about) from Martin Duerst on 2001-10-18 (w3c-rdfcore-wg@w3.org from October 2001)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 18 Oct 2001 15:32:35 +0900
To: Graham Klyne <Graham.Klyne@MIMEsweeper.com>
Cc: w3c-rdfcore-wg@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20011018124908.03505370@localhost>
At 10:32 01/10/17 +0100, Graham Klyne wrote:

>At 09:35 AM 10/17/01 +0100, Jeremy Carroll wrote:
>[...]
>>Dave's text is currently neutral vis-a-vis internationalized URI's.
>>Aaron's text takes a URI-ref to be precisely a US-ASCII URI ala RFC
>>2396.
>>
>>I owe the group some work on internationalization, but currently am of
>>the opinion that we should allow internationalized URIs wherever we
>>allow uri-references; these being resolved into US-ASCII URIs ala RFC
>>2396 (as modified by RFC 2732) at the same time as relative URI's are
>>resolved, using the standard algorithm.
>
>I've been in discussions with I18N group about URIs in CC/PP (which is an 
>application of RDF).
>
>Basically, their position (as I understand it) is that URIs in an XML 
>document should be regarded as an "original character sequence" rather 
>than a "URI character sequence" (see RFC 2396, section 2.1).

Well, yes, except that the 'original character sequence' as
in RFC 2396, section 2.1, is not exactly the same as what
we are discussing here. [To give you the details, which you
probably don't want :-), for a file with a filename encoded in
iso-8859-1 and served with a server that doesn't do any
transcoding, the characters in the iso-8859-1 encoding would
be the 'original character sequece' according to rfc 2396,
but it would have to be escaped to be put in XML, RDF,
CC/PP,...]


>Then, when a URI is dereferenced, or otherwise required in "URI character 
>sequence" form, the appropriate transformation to an octet sequence is 
>performed (dependent on the code point set used for the XML document), and 
>then URI escaping (%hh) is applied to yield a "URI character 
>sequence".  If the XML document uses Unicode characters, then the required 
>octet encoding would be UTF-8, which provides an unambiguous 
>interpretation for URIs in XML.  If other code-point sets are being used, 
>then the interpretation is subject to application interpretation, but I 
>presume that use of non-Unicode codepoint sets is generally discouraged 
>for new data.

Some corrections:

- Each and every XML document is (defined as) a sequence of Unicode
   characters. There are no exceptions at all. For details, please
   also see the Infoset (http://www.w3.org/TR/xml-infoset/).

- XML documents are quite often represented as a stream of bytes.
   Various character encodings are used to represent XML documents
   in this case. The encodings are what's indicated with the
   'encoding' pseudo-attribute on the '<?xml' pseudo-PI at the
   start of an XML document. From the view of the XML Rec,
   these encodings are not too much different than e.g. applying
   some additional compression or encription, except that there
   is the 'encoding' pseudo-attribute for 'bootstrapping'.

- Because XML documents are sequences of Unicode characters,
   the conversion from IRIs (Internationalized Resource Identifiers)
   in an XML document to an "URI character sequence" (where the
   escaping is based on byte values) always has to use UTF-8.


>There is some language in the XML linking spec 
>(http://www.w3.org/TR/xlink/#link-locators) that I am planning to adapt 
>for the CC/PP spec:

Yes, this is the clearest language we currently have.


>[[[
>The value of the href attribute must
>be a URI reference as defined in [IETF RFC 2396], or must result in
>a URI reference after the escaping procedure described below is applied. The
>procedure is applied when passing the URI reference to a URI resolver.
>
>Some characters are disallowed in URI references, even if they are allowed
>in XML; the disallowed characters include all non-ASCII characters, plus the
>excluded characters listed in Section 2.4 of [IETF RFC 2396], except
>for the number sign (#) and percent sign (%) and the square bracket characters
>re-allowed in [IETF RFC 2732]. Disallowed characters must
>be escaped as follows:
>
>- Each disallowed character is converted to UTF-8 [IETF RFC 2279]
>as one or more bytes.
>
>- Any bytes corresponding to a disallowed character are escaped with
>the URI escaping mechanism (that is, converted to %HH,
>where HH is the hexadecimal notation of the byte value).
>
>- The original character is replaced by the resulting character sequence.
>]]]
>
>I contend that this approach is reasonable, but not currently documented 
>in any W3C
>Recommendation in such a way that suggests that it applies to any URI in 
>an XML document.

The Character Model says that every W3C spec has to do it.
It's not something that can be mandated e.g. in the XML Rec,
because XML processors don't know about most URIs in an XML doc.


>Notwithstanding, I predict that this is how I18N will strongly request we 
>adopt this approach (at least for rdf:about and rdf:resource).

Yes, you are right, with the corrections as given above.
Are rdf:about and rdf:resource the only places where
RDF uses URIs?


>[Later: I note that the anyURI XML schema datatype spec references the XML 
>Linking language quoted above -- Martin, when bashing people's heads about 
>this, why not point people at this spec, which is pretty central to 
>XML?  Now that I've discovered this, I'm inclined to simply cite the XML 
>schema anyURI datatype for CC/PP, and add a non-normative NOTE 
>hiughlighting the implications: any comments?]

Thanks for the hint. I thought I had done so, but I didn't.
I think the main reason was that we were mainly into wording
details, and XML Schema just refers to XLink for that. Anyway,
it sounds like a very good idea.


Regards,   Martin.
Received on Thursday, 18 October 2001 02:32:46 UTC