Re: IRIs

> On Mar 7, 2018, at 10:39 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> The definition of anyURI doesn't allow IRIs, such as
>  https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.
> 
> Just as we added an iemail address type to match modern email addresses, it seems to me that we ought to also add an anyIRI type that accepts IRIs like the aboce.

I am puzzled; what leads you to the conclusion that xsd:anyURI 
does not accept IRIs?

In XSD 1.0 [1], the value space is described as that of 
RFC 2396, as modified by RFC 2732, and the lexical space
is described (roughly) as the set of strings, which after
escaping, turn into URIs as defined by those specs.  The
escaping in question is the then current algorithm for IRIs,
as published in the XLink spec. I believe that later revisions
of the concept of IRI changed the rules for whitespace, but
I don’t recall any other changes likely to be noticeable to
users of the datatype.  Certainly the intent of XSD 1.0 
was to accept IRIs in the lexical space of the type anyURI.

The spec says "This [the mapping from lexical space to value
space] means that a wide range of internationalized resource 
identifiers can be specified when an anyURI is called for”.

In XSD 1.1 [2], the spec is a little more explicit, since the
IRI concept was a little more clearly developed by that time:
"anyURI represents an Internationalized Resource Identifier 
Reference (IRI).  An anyURI value can be absolute or relative, 
and may have an optional fragment identifier (i.e., it may be 
an IRI Reference).  This type should be used when the value 
fulfills the role of an IRI, as defined in [RFC 3987] or its 
successor(s) in the IETF Standards Track.”

During the development of XSD 1.1 the WG responded to
inconsistencies in the 1.0 implementations of the anyURI
type (and, perhaps, to fears that future revisions of the RFCs 
for URIs and IRIs would continue to change the set of legal 
values) by seeking to simplify and future-proof the rules used
for checking schema-validity of IRIs.  For reasons I do not think 
I can successfully reconstruct (at least, not without falling
into depression), it chose to do so by stating clearly that the 
grammar rules specified by the relevant RFCs are effectively 
only advisory, and that for purposes of schema validation,
any sequence of XML characters constitutes a value of the
type.

So in XSD 1.1 it is doubly untrue to say that IRIs are not
accepted as lexical representations of xsd:anyURI:  not only
is it clearly stated that IRIs are to be accepted, but strings
that do not match the current definition of IRIs will *also*
be accepted as schema-valid.  

XForms needs its own IRI type only if stricter validation of the
grammar of URIs and IRIs is needed.  

If in fact stricter validation is needed, the XForms group may 
wish to consider using the datatypes defined in “XSD datatypes 
for strict validation of IRIs and URIs” [3].

It would be very disappointing if the amount of work that went
into making xsd:anyURI accept IRIs turned out to be for
naught.

[1] https://www.w3.org/TR/xmlschema-2/#anyURI
[2] https://www.w3.org/TR/xmlschema11-2/#anyURI
[3] https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html

N.B. I am umable to verify URI [3], since my access privileges
no longer seem sufficient to retrieve the document.  [3] was
prepared for publication as a WG note by the then XML Schema 
WG but never published, since the WG ran out of resources and 
time.  When the XML Core WG took over responsibility for 
XSD, they decided they didn’t have the necessary resources, either.
I would be glad if the work were finally published.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

Received on Wednesday, 7 March 2018 18:37:51 UTC