RE: Clarification of charmod-uri from Jeremy Carroll on 2002-04-29 (w3c-rdfcore-wg@w3.org from April 2002)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 29 Apr 2002 21:42:06 +0100
To: "Aaron Swartz" <me@aaronsw.com>, "RDF Core" <w3c-rdfcore-wg@w3.org>
Message-ID: <JAEBJCLMIFLKLOJGMELDAEOACDAA.jjc@hplb.hpl.hp.com>
> -----Original Message-----
> From: Aaron Swartz [mailto:me@aaronsw.com]
> Sent: 28 April 2002 18:08
> To: RDF Core; Jeremy Carroll
> Subject: Clarification of charmod-uri
>
>
> I am not (yet) planning to protest the decision we made about unicode
> strings in URIs, but I would like some clarification.
>
> 1) Is there some reason why these Unicode characters cannot be
> %encoded? I
> thought someone said something to this effect on the telecon,
> but I didn't
> catch it. If not, what's the rationale for insisting on a
> backwards-incompatible change, when the (comparatively)
> backwards-compatible
> %encoding works just as well?

One reason concerns normal form C issues, and how %-encoded URIs get
displayed.

In a system that %-encodes URIs for storage and reasoning it is highly
desirable that they get unencoded for display. As we have seen there are
multiple ways under unicode of representing characters such as e. Retaining
these within unicode it is possible to specify and realistically expect
implementations of the normal form C constraint - i.e. that the unicode
must be normal form C. This constraint becomes significantly more difficult
to check (i.e. less something that can realistically be expected of a
unicode library) if the check is does this %-encoded uri correspond to a
UTF-8 encoding of a unicode string that is not in NFC.

I see a significant risk of URI fraud in an international context if there
is no normal form c constraint. I see this constraint as looking highly
unrealistic if it is made on %-escaped URIs.

A further issue is to do with interoperability of lower and upper case %
encoding.

The standard treatment of URIrefs is to do as little processing as
possible. So xml namespaces differ if the uri-ref differs in spelling, not
intent. In particular:

http://example.org/#Andr%c3%a9

and

http://example.org/#Andr%C3%A9

are different as far as XML Namespaces goes.

If we assert that these are both identical to

http://example.org/#Andre

we need to account for how they are the same under RDF.

---

A less significant reason is showing that preserving the original input
characters is mandatory (these are the most useful way to display the URI
on output).


>
> 2) Am I correct in saying that this means that RDF will no
> longer be using
> URI-refs to identify Resources? Is this consistent with our charter?


Misha argues that RDF M&S has already had its meaning "clarified" in this
way by errata 26 against XML second edition. (I would confess to having
sense of a non-backwardly compatible clarification, rather like the
unqualified attributes issue!).

See:
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Mar/0012.html




>
> 3) Does this mean we will allow other characters like %20
> (space) into our
> URIs that have traditionally been %encoded? If not, how do we
> decide what's
> allowed and what isn't?

The approved test case
http://www.w3.org/2000/10/rdf-tests/rdfcore/rdf-charmod-uris/test002.rdf

shows that % escaped URIs are still legal.
Do you think a %20 case would also be helpful.

test003 and test004 (see the manifest) divorces the meaning of the %
escaped URI from the non-% escaped URI, at least for the RDF MT. I have
suggested that this should be seen like alternative spelling of what is
operationally the same URI (e.g. http: vs HTTP: ).


>
> 4) Which spec is going to describe these new identifiers? The IRI spec[1]
> seems to have fifteen rather complex but relevant pages on them. Can we
> afford the extra time it may take to integrate these and review them?

I am against referring to the IRI spec (it is a draft and does not yet have
a consensus around it).

The best texts I have seen are:
 http://www.w3.org/TR/xlink/#link-locators
and
 http://www.w3.org/XML/xml-V10-2e-errata#E26

Either of these is short enough to be copied and edited.

>
> All the best,
> --
> [ "Aaron Swartz" ; <mailto:me@aaronsw.com> ; <http://www.aaronsw.com/> ]

I feel some sense of failure at having arrived at such a singular lack of
consensus on this issue. I do agree with the sense at the telecon that it
was better to make the decision now, and see how much support or dissent it
generates in the wider community; but regret that we have not had a fuller
debate in telecon and e-mail. I would particularly like to hear from Jos
and Brian as to why they voted against.

For instance, while I have surfaced the uri fraud issue before, I don't
think I have discussed the lowercase/uppercase % encoding issue. Both of
these have helped form my opinion that the resolution we came to was a good
one.

I also note that I am influenced by a sense of the inappropriateness of an
historic limitation of the US phone system (that the eighth bit used to be
dirty) should limit the functionality available to web users around the
world. If this has been significant in our voting then perhaps that could
raise charter issues.

Jeremy
Received on Monday, 29 April 2002 16:42:55 UTC