Comment on anyURI and its usage of IRIs

Dear all,

Based on the discussion during the architecture domain telephone 
conference this week and the discussion in the i18n-core working group, 
I have created the following proposal, which I would like to submit to 
the xml schema working group on behalf of i18n-core. Please give me your 
comments, if you have any; I will send the mail to the xml-schema 
working group on Monday.

Regards, Felix.



Proposals for changes of the datatype anyURI, as described by xml schema 
(cf. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#anyURI):

The i18n-core-wg proposes an update of the datatype anyURI which is 
defined in the current version of XML Schema part 2, cf. 
http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#anyURI Currently the 
mapping from anyURI  values to URIs is defined in terms of the XLINK 
specification, cf. 
http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators . We think 
that anyURI should refer to the specification of Internationalized 
Resource Identifiers (IRIs) instead, cf. http://www.ietf.org/rfc/rfc3987 
. The IRI specification has achieved a stable status. It is a 
specification of how to expand the set of characters in URIs from a 
subset of US-ASCII to the Universal Character Set (Unicode/ISO 10646). 
W3C has announced to support the IRI specification, so we propose its 
application for anyURI. Our proposal for anyURI consists of 4 points:

(1) anyURI should refer to sec. 3.1 of the IRI-spec, instead of XLINK. 
This is important for example because of the normalization requirements 
as described in the IRI specification: if a legacy-encoding is not 
normalized before mapping from anyURI to URIs, the result might be 
different from the normalized case. The IRI specification gives an 
example for such a legacy-encoding from Vietnamese encoded as 
windows-1258, cf. also sec. 3.1. The normalization problem is only an 
example of many other important details which are discussed in the IRI 
specification.
(2) Any reference to URI should be updated from RFC 2396 to RFC 3987. 
For domain names, anyURI should refer to the IDN-part of the ABNF of the 
IRI-spec, cf. sec. 2.2 of the IRI-spec. This will allow access to 
internationalized domain names.
(3) The definition of anyURI may want to point to the following 
paragraph from section 3.1 of the IRI specification:
"Systems accepting IRIs MAY also deal with the printable characters in 
US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", 
"}", "|", "\", "^", and "`", in step 2 above.  If these characters are 
found but are not converted, then the conversion SHOULD fail.  Please 
note that the number sign ("#"), the percent sign ("%"), and the square 
bracket characters ("[", "]") are not part of the above list and MUST 
NOT be converted.  Protocols and formats that have used earlier 
definitions of IRIs including these characters MAY require 
percent-encoding of these characters as a preprocessing step to extract 
the actual IRI from a given field. This preprocessing MAY also be used 
by applications allowing the user to enter an IRI."
(4) an editorial issue: the reference from anyURI to section 8 of the 
old version of the "character model for the world wide web" 
specification should be changed to the new charmod-resid specification, 
cf. http://www.w3.org/TR/2004/CR-charmod-resid-20041122/

Received on Wednesday, 30 March 2005 19:01:38 UTC