Re: URI terminology demystified (I18N details) from Dan Connolly on 2001-09-20 (w3c-rdfcore-wg@w3.org from September 2001)

From: Dan Connolly <connolly@w3.org>
Date: Thu, 20 Sep 2001 09:11:46 -0500
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
CC: w3c-rdfcore-wg@w3.org
Message-ID: <3BA9F922.A2319138@w3.org>
Jeremy Carroll wrote:
> 
> Hmmm, I was just examing the XML specs concerning system identifiers
> ....
> 
> See:
> 
> http://www.w3.org/XML/xml-V10-2e-errata#E4
> 
> Your quote from the old RDF spec:
> 
> Dan Connolly wrote:
> >
> >   Note: Although non-ASCII characters in URIs are not allowed by [URI],
> > [XML]
> >   specifies a convention to avoid unnecessary incompatibilities in
> > extended URI
> >   syntax. Implementors of RDF are encouraged to avoid further
> > incompatibility and
> >   use the XML convention for system identifiers. Namely, that a
> > non-ASCII character
> >   in a URI be represented in UTF-8 as one or more bytes, and then these
> > bytes be
> >   escaped with the URI escaping mechanism (i.e., by converting each byte
> > to %HH,
> >   where HH is the hexadecimal notation of the byte value).
> >
> 
> This seems to be a misinterpretation of the XML spec, which the erratum
> clarifies.

Strictly speaking, it's not; system identifiers only occur
in things like <!ENTITY ...> delcarations. The value of
an rdf:resource attribute isn't a system identifier (unless
we change RDF 1.0 to say that it is for some reason).


> We should, IMO, hence go along with the clarification, and the RDF/XML
> processor is responsible for escaping non-permitted characters in
> URI-refs.

It's not XML 1.0 that compells us to go with the
Unicode->URI escaping in resource/about/ID,
but the history of HTML 4.0 href, the text from RDF 1.0
excerpted above, the precedent of the XLink REC (xlink:href),
and the recent opinion of the I18N WG expressed
in the charmod spec.

> I also note that this is consistent with our test case:
> 
> http://www.w3.org/2000/10/rdf-tests/rdfcore/rdfms-difference-between-ID-and-about/test2.nt
> 
> http://www.w3.org/2000/10/rdf-tests/rdfcore/rdfms-difference-between-ID-and-about/test2.rdf
> 
> which has not been approved, seems to suggest the following
> 
> 1: ID's are subject to the same URI encoding rule.

Yup. (that is: values of rdf:ID attributes.)

> 2: N-triple URIs are in US-ASCII and must be already encoded.

Yes; to be crystal clear: All URIs are in US-ASCII.
URIs appear in N-triple syntax as-is, with no further encoding.

> These seem like good things.

Agreed.

> Dan - do you know about namespace declarations?
>     - are the URIs in Unicode (needing escaping) or US-ASCII?

I think namespace declarations must use URI references as-is;
i.e. you're not allowed to put non-uri characters in them.
This follows from
	(a) a literal reading of the namespaces REC,
	which says that the value of an xmlns attribute
	is a namespace name and a namespace name *is* URI references
	(not that they can be decoded into URI references).
	Nobody has suggested changing/clarifying this
	aspect of the namespace spec, to my knowledge.

	(b) my own observation that the XML infrastructure
	treats namespace names as plain old strings, and
	never decodes or otherwise mangles them (other
	than normal XML attribute value literal interpretation).

It's at least worth a health-warning to say "if you
put non-URI characters in your namespace names, LOOK OUT!
We know of no software that's going to help you!"

And it's worth a test case or two. Care to cook some up?

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Thursday, 20 September 2001 10:12:51 UTC