Re: a few URI/href issues captured with test cases from Julian Reschke on 2009-05-21 (www-tag@w3.org from May 2009)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 21 May 2009 19:05:52 +0200
To: Dan Connolly <connolly@w3.org>
CC: www-tag@w3.org
Message-ID: <4A1589F0.2000705@gmx.de>

Dan Connolly wrote:
> Larry, Henry, John,
> 
> I made some progress on ACTION-265
> 
> "Work with Larry, Henry to frame technical issues relating to the
> vairous overlapping specs. about URIs, IRIs and encoding on the wire"
>  --
>   http://www.w3.org/2001/tag/group/track/actions/265
> 
> In particular...
> 
>   http://www.w3.org/html/wg/href/elab.html
>   http://www.w3.org/html/wg/href/elab10.html
> 
> This is a successive elaboration of the issues with
> issues captured as test cases.
> 
> It's what I was talking about when I wrote...
> 
> (the best way to slow down is to make test cases. here's hoping I find
> time)
>  -- http://www.w3.org/2001/tag/2009/05/07-minutes#item05
> 
> 
> The issues covered are
> 
>  Space in Path
>  Colon in path
>  Non-ASCII characters in path
>  Non-ASCII characters in path and query/search
> 
> Larry, I showed you an earlier draft and you weren't too
> excited. I still find this is the way my brain needs
> to capture issues.
> 
> John, could you take a look at see if I'm making sense, at least?
> 
> I gather Henry is out this week...
> ...

This has been under discussion for something like nine months. I think 
the issues, as documented by Ian, Henri and now by Dan are 
well-understood (and thanks for posting examples and test cases).

I think when we discussed this last October, Larry and several others 
(including myself...) pointed out that the additional complexity as 
compared to IRIs (RFC3987) can easily be layered *above* IRI, mapping 
HTML5-references to IRIs by just by stating:

1) non-IRI characters found in the query part are encoded using the 
document's character encoding, then percent-escaped (*)

2) all other non-IRI characters (such as space) are encoded using UTF-8, 
then percent-escaped

Or, if we use LEIRIs as foundation instead 
(<http://tools.ietf.org/html/draft-duerst-iri-bis-04#section-7>), we end 
up with a *single* rule:

1') non-IRI characters found in the query part are encoded using the 
document's character set, then percent-escaped (*)

Why does it need to be complex than that?

BR, Julian

(*) Note that HTML5 considers only links with non-URI characters in the 
query part as valid if the document's encoding is UTF-8/16 (as far as I 
recall).

Received on Thursday, 21 May 2009 17:06:39 UTC