W3C home > Mailing lists > Public > public-html@w3.org > June 2009

Re: Updating the IRI spec to include "web addresses"

From: Julian Reschke <julian.reschke@gmx.de>
Date: Mon, 01 Jun 2009 16:14:04 +0200
Message-ID: <4A23E22C.2010502@gmx.de>
To: Larry Masinter <masinter@adobe.com>
CC: "Roy T. Fielding" <fielding@gbiv.com>, HTML WG <public-html@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Larry Masinter wrote:
> I've found it convenient to use "HRef" as a shorthand
> in the document.
> What I'm not sure of is whether I can get away with
> just *replacing* the IRI -> URI algorithm, or if
> I should leave both HRef -> URI and IRI -> URI.

I think the IRI -> URI algorithm should not change (expect for the bit 
about normalization discussed previously).

What should be added is HRef -> IRI (whch implies that in some cases, 
that mapping would need to map query parameters to plain ASCII).

LEIRIs then could become a special case of the thing described above.

> Right now, the HTML5/"Web Address" draft is written as
> "how to parse" and "how to resolve relative to absolute".
> I'm not sure if it's possible to recast it as
> HRef => URI, but it's certainly worth a try.

Repeating what I suggested on www-tag a few days ago 

This has been under discussion for something like nine months. I think
the issues, as documented by Ian, Henri and now by Dan are
well-understood (and thanks for posting examples and test cases).

I think when we discussed this last October, Larry and several others
(including myself...) pointed out that the additional complexity as
compared to IRIs (RFC3987) can easily be layered *above* IRI, mapping
HTML5-references to IRIs by just by stating:

1) non-IRI characters found in the query part are encoded using the
document's character encoding, then percent-escaped (*)

2) all other non-IRI characters (such as space) are encoded using UTF-8,
then percent-escaped

Or, if we use LEIRIs as foundation instead
(<http://tools.ietf.org/html/draft-duerst-iri-bis-04#section-7>), we end
up with a *single* rule:

1') non-IRI characters found in the query part are encoded using the
document's character set, then percent-escaped (*)

Why does it need to be *more* complex than that?

BR, Julian
Received on Monday, 1 June 2009 14:14:49 UTC

This archive was generated by hypermail 2.4.0 : Saturday, 9 October 2021 18:44:48 UTC