Re: Change definition of URL to normatively reference IRI specification using a well-defined interface from Erik van der Poel on 2010-04-09 (public-html@w3.org from April 2010)

From: Erik van der Poel <erikv@google.com>
Date: Fri, 9 Apr 2010 06:53:29 -0700
To: Ian Hickson <ian@hixie.ch>
Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Ted Hardie <ted.ietf@gmail.com>, Maciej Stachowiak <mjs@apple.com>, Larry Masinter <LMM@acm.org>, Julian Reschke <julian.reschke@gmx.de>, Marc Blanchet <Marc.Blanchet@viagenie.ca>, Sam Ruby <rubys@intertwingly.net>, Paul Cotton <Paul.Cotton@microsoft.com>, Michel SUIGNARD <Michel@suignard.com>, public-html <public-html@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <q2wc07a32651004090653h92c1ffb7xd06b6d47051be8aa@mail.gmail.com>

I think we need to move the relative resolution from Issue 2 to Issue
1 because the major browsers return the resolved path in the DOM
pathname.

The browsers convert HTML documents into Unicode. Some of the browsers
then return that Unicode in DOM APIs, while others return the "wire"
format, depending on the URL component. The following results are from
a test case with <a href="...">.

IE8 returns Unicode in all of the major DOM APIs (hostname, pathname, search).

Firefox 3.6 and Opera 10 return Unicode in hostname, but they return
the wire format in pathname and search (%-encoded UTF-8 and %-encoded
original encoding, respectively).

Chrome 4 returns the wire format in hostname, pathname and search. The
wire format for hostname is Punycode.

If we decide that the spec should say that hostname, pathname and
search must return Unicode, then Issue 2 would be for specifying the
wire format (Punycode in host, %-encoded UTF-8 (or original) in path,
and %-encoded original in query).

Erik

On Fri, Apr 9, 2010 at 2:10 AM, Ian Hickson <ian@hixie.ch> wrote:
> On Fri, 9 Apr 2010, "Martin J. Dürst" wrote:
>> >
>> > Issue 1:
>> > ========================================================================
>> > Update the IRI specification to define an algorithm with the following
>> > characteristics:
>>
>> In order to make it easier to understand this for people who are not deeply
>> involved in the HTML5 effort, I'd like to confirm that this is the algorithm
>> that HTML5 uses to split an URI/IRI into various components, each of which is
>> then accessible via a (Javascript) DOM API function. So I guess the title of
>> our issue should be something like:
>> "Ensure that the IRI spec defines how to split an IRI into components in a way
>> that's referencable by the HTML5 spec" or some such.
>
> Right.
>
>
>> > Exactly what this algorithm must do is a matter that will need careful
>> > research, reverse-engineering existing UAs.
>>
>> My understanding was that a lot of this research had already been done,
>> and that we would basically try to match whatever was in the HTML5 spec
>> before Dan Connolly and Michael Sperberg-McQueen extracted it into a
>> separate draft. Of course, we should always be open to new information
>> coming up, but your sentence above sounds much more like we have to
>> start anew. Can you clarify?
>
> Since the text was written, so many problems have been shown to exist with
> the existing text that frankly I think it would be significantly less work
> to just start over and reverse-engineer the algorithms from scratch than
> to try to first attempt to match what HTML5 used to say and then verify it
> for correctness.
>
> (Personally, if the working groups were to decide that HTML5 is where
> these algorithms should be, I'd probably just throw out the old text and
> start again from nothing, working closely with the relevant engineers at
> the various major browser vendors to check what they consider important
> and what don't, trying to reconcile the various behaviours with each
> other, with legacy content requiremnts, and with the intent of the URI and
> IRI specs. I almost certainly wouldn't start from the old algorithms.)
>
>
>> > Issue 2:
>> > ========================================================================
>> > Update the IRI specification to define an algorithm with the following
>> > characteristics:
>>
>> Again to clarify here, if I understand correctly, the HTML5 spec needs
>> such an algorithm to resolve relative references with respect to a base
>> URI
>
> Right. This algorithm is used for resolving URLs relative to a base URL,
> and also to convert URLs into a more canonical (if not always valid) form.
>
>
>> (my wild guess is that B is the base, and A is the relative URI below,
>> can you confirm)?
>
> Right. Of course, A need not be relative, it could be itself an absolute
> URL, or it could be something unparseable.
>
> HTH,
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 9 April 2010 13:54:05 UTC