parsing URI (references) according to RFC 3986 from Julian Reschke on 2011-06-18 (public-iri@w3.org from June 2011)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 18 Jun 2011 13:56:39 +0200
To: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4DFC9277.10801@gmx.de>

Hi,

some time ago I started working on a sample implementation of the RFC 
3986 algorithms for parsing and resolving references. The results are 
over here (incl. source files for people who want to play around with 
it, or add more tests):

	http://greenbytes.de/tech/tc/uris/

Note that the Regular Expression in 
<http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.B> works with 
any kind of input, not just valid URIs. Also, the resolution algorithm 
in <http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.5> does 
not depend on valid components.

I believe this can be a basis for the algorithms the HTML5 people are 
looking for. What's missing is:

- optional preprocessing (strip leading/trailing whitespace)

- optional postprocessing (fix non-ASCII characters in query parameter 
when not originating from UTF-8 encoded document; maybe scheme-specific 
cleanup).

What's also missing is a way to uniquely identify a test case; the 
obvious answer is to assign a unique identifier for each of them -- does 
anybody have a better idea that requires less work???

Feedback welcome; in particular with respect to interesting additional 
tests (I don't have any non-URI tests yet).

Best regards, Julian

Received on Saturday, 18 June 2011 11:57:18 UTC