IRI-everywhere
I am the issue owner of http://www.w3.org/2000/03/rdf-tracking/#rdf-charmod-uris,
and wished to share some of the analysis that I did in applying the lessons
of Charmod to URIs in RDF.
The main function of a URI/IRI in RDF is as a name, just like in XML namespaces.
Being able to retrieve a URL is of less interest.
Three levels of IRI definition
Allow (nearly) any Unicode
The XLink href text, and the XML System Literal text (though erratum) provide
a working definition of an IRI that allows any Unicode string that would %-escape
to a URI.
Disallow non-NFC
A theme in charmod is the important of NFC, and early uniform normalization.
We analyzed this and found security issues when two XLink style URIs differ
only in normalization (two different names look the same, and hence cannot
be visually distinguished: see:
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Mar/0027.html
The two strings:
"http://example.org/André"
"http://example.org/André"
are not identical but look the same.
)
Hence, in the IRI draft and in the current RDF WDs the identifiers must
be in NFC.
In XML Namespaces 1.1, when used within XML 1.1, this restriction (that
IRIs must be in NFC) is implicit. While XML Namespaces 1.1 does not mention
normalization, since an XML 1.1 document must be fully normalized, all attribute
values are in NFC, including those for xmlns attributes.
Disallow certain BiDi
The IRI draft also considers problems to do with BiDi text. I don't pretend
to understand these. Further Unicode character sequences, which involve right-to-left
characters and other bidi markup, which are legal under XLink href, are not
legal as IRIs according to the IRI draft. This part of the IRI draft, is,
as far as I am aware, still cooking.
IRI equality
The definition of equality between identifiers is important for any system
like XML Namespaces or RDF which uses URIs/IRIs as names rather than as URLs.
This notion of equality differs from operational equivalence. (Note the HTTP
spec, when discussing equality of http URLs, is essentially discussing operational
equivalence, rather than abstract equality of names).
Identity - Character by Character
The simplest notion of equality is found in the XML Namespaces spec (both
1.0 and 1.1) i.e. character by character equality. This is good from an implementors
point of view; and easy to explain to users.
Same Resource? Scheme specifc equality
Another intuitive sense of equality is that the two URIs/IRIs identify the
same resource. Attempting to capture this would involve both generic processing
to do with case, %-escapes, IANA protocol numbers, etc. and scheme specific
processing. Even then, such an implementation would not fully capture this
intuitively appealing concept of equality.
Same URI - (ASCII) URI as value space for anyURI
When faced with IRIs as a migration from URIs it is tempting to define equality
of IRIs as two IRIs are equal if their URIs are equal. This essentially punts
the question, and typical involves URI character-by-character equality. It
also raises stupid questions about %7E and %7e ...
From an application point of view, defining equality in this way creates
work:
- the original characters (the IRI) must be retained, in order to redisplay
the identifier to the user in the manner they expect.
- either the %-escaping must be repeatedly performed or the value cached,
both of which adds significant overhead.
- When two different IRIs %-escape to the same URI (possible when one
IRI is already, perhaps partially, %-escaped) and the associated concepts
get merged, the application will probably retain one, and arbitratily discard
t he other, leading to potential user confusion.
- It may be necessary to normalize case of hexadecimal characters in %
escape sequences, this runs counter to the lack of case normalization in URI
comparison.
Proposed Actions
I think there are good arguments, presented in Charmod, for requiring NFC.
I think there are good arguments for, when an IRI is being used as an identifier,
then equality of IRIs should be character-by-character comparison.
I think the sooner an I18N-WG recommendation has normative text defining
IRIs the better.
I think this should not be made dependent on resolving the bidi issues that
are hard.
Current State-of-Art=Disallow non-NFC
In the specs that are already at full recommendation we have IRI support
allowing any unicode. In specs such as Namespaces 1.1 (within XML 1.1) we
have the proposal that IRIs should be in NFC. I think that the W3C should
have this as the current expectation on WGs - the specs should support IRIs
(without the BiDi) and should require full normalization.
Replace Charmod Section 8 with explicit text
I understand that charmod is currently held up, partly because of its dependence
on IRI-draft.
Thus I propose to modify charmod as follows, and move it to Candidate/Proposed
Rec.
- delete normative text in section 8 Character Encoding in URI References
- add text defining an IRI, similar to that in XML Namespaces 1.1, XLink,
or the RDF concepts WD, with the explicit constraint that IRIs are in NFC.
- add an eratum to the charmod erratum page with fragID #IRI and text -
"No IRI erratum at present."
- add a note to section 8 of charmod suggesting that the reader refer to
that erratum
- add a constraint in section 8 that documents and implementations SHOULD
not use right to left or bidi characters in IRIs
- add a note explaining that IRI-draft does address bidi, but this is still
in development
Once IRI-draft is a recommendation, then update the erratum against charmod
replacing section 8 with:
[[Note: this section is deleted. However, specs, implementations and documents
MUST conform with IRI]]
This will allow XML Namespaces 1.1 and other documents to normatively refer
to a definition of IRI. Moreover, once bidi in IRI is cooked, new specs will
be able to refer to IRI.
Jeremy Carroll.