RE: IRI-Everywhere from Julian Reschke on 2002-11-16 (www-tag@w3.org from November 2002)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 16 Nov 2002 12:29:33 +0100
To: "Jeremy Carroll" <jjc@hplb.hpl.hp.com>, <www-tag@w3.org>
Message-ID: <JIEGINCHMLABHJBIGKBCIEPGFMAA.julian.reschke@gmx.de>
Some more thoughts:

"Hence, in the IRI draft and in the current RDF WDs the identifiers must be in NFC. In XML Namespaces 1.1, when used within XML 1.1, this restriction (that IRIs must be in NFC) is implicit. While XML Namespaces 1.1 does not mention normalization, since an XML 1.1 document must be fully normalized, all attribute values are in NFC, including those for xmlns attributes."

- this is not true for the current XML 1.1 draft (normalization is *optional*, [1])

- normalization is no guarantee that similar looking IRIs that are different char-by-char can occur ([2])

Julian


[1] <http://www.w3.org/TR/xml11/#sec2.13>
[2] <http://www.ietf.org/internet-drafts/draft-duerst-iri-02.txt>, section 5.1.c

--
<green/>bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760 
-----Original Message-----
From: www-tag-request@w3.org [mailto:www-tag-request@w3.org]On Behalf Of Jeremy Carroll
Sent: Tuesday, November 12, 2002 8:21 PM
To: www-tag@w3.org
Subject: IRI-Everywhere


IRI-everywhere
I am the issue owner of http://www.w3.org/2000/03/rdf-tracking/#rdf-charmod-uris, and wished to share some of the analysis that I did in applying the lessons of Charmod to URIs in RDF.
The main function of a URI/IRI in RDF is as a name, just like in XML namespaces.
Being able to retrieve a URL is of less interest.

Three levels of IRI definition
Allow (nearly) any Unicode
The XLink href text, and the XML System Literal text (though erratum) provide a working definition of an IRI that allows any Unicode string that would %-escape to a URI.

Disallow non-NFC
A theme in charmod is the important of NFC, and early uniform normalization.
We analyzed this and found security issues when two XLink style URIs differ only in normalization (two different names look the same, and hence cannot be visually distinguished: see:
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Mar/0027.html
The two strings:
"/André" 
"/Andre´" 
are not identical but look the same, and IRIs including these strings hence are different, but visually indistinguishable names)
Hence, in the IRI draft and in the current RDF WDs the identifiers must be in NFC.
In XML Namespaces 1.1, when used within XML 1.1, this restriction (that IRIs must be in NFC) is implicit. While XML Namespaces 1.1 does not mention normalization, since an XML 1.1 document must be fully normalized, all attribute values are in NFC, including those for xmlns attributes.

Disallow certain BiDi
The IRI draft also considers problems to do with BiDi text. I don't pretend to understand these. Further Unicode character sequences, which involve right-to-left characters and other bidi markup, which are legal under XLink href, are not legal as IRIs according to the IRI draft. This part of the IRI draft, is, as far as I am aware, still cooking.

IRI equality
The definition of equality between identifiers is important for any system like XML Namespaces or RDF which uses URIs/IRIs as names rather than as URLs. This notion of equality differs from operational equivalence. (Note the HTTP spec, when discussing equality of http URLs, is essentially discussing operational equivalence, rather than abstract equality of names).

Identity - Character by Character
The simplest notion of equality is found in the XML Namespaces spec (both 1.0 and 1.1) i.e. character by character equality. This is good from an implementors point of view; and easy to explain to users.

Same Resource? Scheme specifc equality
Another intuitive sense of equality is that the two URIs/IRIs identify the same resource. Attempting to capture this would involve both generic processing to do with case, %-escapes, IANA protocol numbers, etc. and scheme specific processing. Even then, such an implementation would not fully capture this intuitively appealing concept of equality.

Same URI - (ASCII) URI as value space for anyURI
When faced with IRIs as a migration from URIs it is tempting to define equality of IRIs as two IRIs are equal if their URIs are equal. This essentially punts the question, and typical involves URI character-by-character equality. It also raises stupid questions about %7E and %7e ...
From an application point of view, defining equality in this way creates work:

the original characters (the IRI) must be retained, in order to redisplay the identifier to the user in the manner they expect. 
either the %-escaping must be repeatedly performed or the value cached, both of which adds significant overhead. 
When two different IRIs %-escape to the same URI (possible when one IRI is already, perhaps partially, %-escaped) and the associated concepts get merged, the application will probably retain one, and arbitratily discard t he other, leading to potential user confusion. 
It may be necessary to normalize case of hexadecimal characters in % escape sequences, this runs counter to the lack of case normalization in URI comparison.

 Conclusion and Proposed Actions
I think there are good arguments, presented in Charmod, for requiring NFC.
I think there are good arguments for, when an IRI is being used as an identifier, then equality of IRIs should be character-by-character comparison.
I think the sooner an I18N-WG recommendation has normative text defining IRIs the better.
I think this should not be made dependent on resolving the bidi issues that are hard. 

Current State-of-Art=Disallow non-NFC
In specs that are already at full recommendation we have IRI support allowing any unicode. In specs  nearing completion such as Namespaces 1.1 (within XML 1.1) we have the proposal that IRIs should be in NFC. I think that the W3C should have this as the current expectation on WGs - the specs should support IRIs (without the BiDi) and should require full normalization. 

Replace Charmod Section 8 with explicit text

I understand that charmod is currently held up, partly because of its dependence on IRI-draft.
Thus I propose that the following  modifications be made to  charmod, and  that it moves  to Candidate/Proposed Rec.
- delete normative text in section 8 Character Encoding in URI References
- add text defining an IRI, similar to that in XML Namespaces 1.1, XLink, or  the RDF concepts WD, with the explicit constraint that IRIs are in NFC.
- add an eratum to the charmod erratum page with fragID #IRI and text - "No IRI erratum at present."
- add a note to section 8 of charmod suggesting that the reader refer to that erratum
- add a constraint in section 8 that documents and implementations SHOULD not use right to left or bidi characters in IRIs 
- add a note explaining that IRI-draft does address bidi, but this is still in development

Once IRI-draft is a recommendation, then update the erratum against charmod replacing section 8 with:
[[Note: this section is deleted. However,  specs, implementations and documents MUST conform with IRI]]
This will allow XML Namespaces 1.1 and other documents to normatively refer to a definition of IRI. Moreover, once bidi in IRI is cooked, new specs will be able to refer to IRI.


Jeremy Carroll.
Received on Saturday, 16 November 2002 06:30:11 UTC