RE: Outstanding Issues - rdf-charmod-uris from Jeremy Carroll on 2002-02-20 (w3c-rdfcore-wg@w3.org from February 2002)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Wed, 20 Feb 2002 13:46:49 -0000
To: "Brian McBride" <bwm@hplb.hpl.hp.com>, "RDF Core" <w3c-rdfcore-wg@w3.org>
Cc: <w3c-i18n-ig@w3.org>
Message-ID: <JAEBJCLMIFLKLOJGMELDAEBBCDAA.jjc@hplb.hpl.hp.com>

> rdf-charmod-uris: Does the treatment of uris conform to charmod ?

> We need an owner to check this

Again I would prefer not to own this.
However, we did discuss this a bit in the thread:

http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Oct/0330.html

My understanding of where we got to was:

In RDF URIrefs are used as unique labels.

The uniqueness is important and so an important aspect is being able to
compare to URIs from different sources and saying that they are the same or
not.

M&S specifies a URIref as per RFC 2396. I understand that as being (a subset
of) US-ASCII only, and hence not charmod conformant.

A required change is to permit the full range of unicode characters in URIs
wherever a URIref is permitted in the RDF/XML grammar.

Such an international URI is subject to a standard algorithm (given in
charmod) to convert it into a US-ASCII URI.

We need to ensure that:
- a URI when given using international characters, and when given using
US-ASCII compares equal.


This can be done by one of the following three techniques:
[A] normalizing URIs on input to US-ASCII
[B] normalizing URIs on input to international form
[C] using the URI normalization algorithm as part of the URI compare
algorithm.

[C] looks inefficient and inelegant.
[B] doesn't work, because a US-ASCII URI with % escapes in it does not
specify the charset used for the encoding, whereas the algorithm the other
way assumes UTF-8.

Thus I believe [A] is the answer.
This normalization should be done in a non-ambiguous way and so I favour
specifying that the hexadecimal escape sequences e.g. "%A3" should not use
a-f but use A-F instead. This allows binary compare.
It also means that:
- %hh must be normalized to %HH where hh is a pair of hexadecimal lower case
digits.
- % is not allowed in the international form of a URI except for introducing
hexadecimal escape sequences. A URI that really does contain a % must then
be encoded by the original document author as "%25"

I can produce some test cases illustrating this, in which RDF/XML documents
with mixed usage of international and US ASCII URIs work successfully.

I suggest that N-Triple be restricted to US ASCII URIs only. N-Triple is not
intended as an end user document format and internationalization concerns do
not apply.

Jeremy

Received on Wednesday, 20 February 2002 08:47:05 UTC