W3C home > Mailing lists > Public > whatwg@whatwg.org > April 2006

[whatwg] Conformance requirements for IRIs

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 17 Apr 2006 19:14:14 +0300
Message-ID: <FF8864C2-5542-460A-90C7-91BB02A2D152@iki.fi>
In WA 1.0 and WF 2.0 some values are required to be IRIs and some  
values are required to be IRI references. I'm confused about what  
exactly this means in terms of conformance checking. (WF 2.0 does say  
something about processing in a browser, though.)

First, I was amazed to learn that for pure non-infoset-augmenting  
validation xsd:anyURI datatype does not mean anything useful beyond  
token and that it is not exactly an IRI reference.
http://www.imc.org/atom-syntax/mail-archive/msg17990.html
http://www.mail-archive.com/rng-users at yahoogroups.com/msg00350.html

Having read
http://www.w3.org/TR/xlink/#link-locators
I started to suspect that just about every string indeed can be  
considered sort of an IRI reference that can munged into an IRI  
reference so there's nothing to check.

Then I found
http://jena.sourceforge.net/tmp/javadoc/com/hp/hpl/jena/iri/ 
IRIFactory.html
which provides a fascinating number of enforcement options. I could  
write a custom datatype wrapper for it, but I don't know which  
options to use.

I'd appreciate some guidance on which enforcement options to use.  
(E.g. should knowledge of the http scheme used? Should security  
issues be flagged as non-conforming? Should "SHOULD" violations be  
flagged as non-conforming? Etc.)



(This is the first time I venture into the world of IRIs. I have  
intuitively thought that they are trouble, so I have knowingly  
avoided minting non-URI IRIs myself.

I suspected that bad stuff happens with IRIs containing decomposed  
character sequences. (These can be found in the URI form due to HFS+- 
backed Apache setups.) Now that I've read the RFC, I think it is a  
very bad idea to allow decomposed characters in IRIs and that the RFC  
does not require percent encoding character sequences that are not  
invariant under NFC.

This may have relevance to how the WF 2.0 url input works. That is,  
it probably SHOULD (MUST?) NOT percent-decode URIs that would result  
in IRIs that are not invariant under NFC.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Monday, 17 April 2006 09:14:14 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:46 UTC