Re: Change definition of URL to normatively reference IRI specification using a well-defined interface

I asked a co-worker to test Safari 4 on Mac. It returns Unicode in
pathname, but it returns the wire format in hostname and search.

Also, Firefox, Chrome and Safari include the initial slash (/) in
pathname, while IE and Opera omit it.

Erik

On Fri, Apr 9, 2010 at 6:53 AM, Erik van der Poel <erikv@google.com> wrote:
> I think we need to move the relative resolution from Issue 2 to Issue
> 1 because the major browsers return the resolved path in the DOM
> pathname.
>
> The browsers convert HTML documents into Unicode. Some of the browsers
> then return that Unicode in DOM APIs, while others return the "wire"
> format, depending on the URL component. The following results are from
> a test case with <a href="...">.
>
> IE8 returns Unicode in all of the major DOM APIs (hostname, pathname, search).
>
> Firefox 3.6 and Opera 10 return Unicode in hostname, but they return
> the wire format in pathname and search (%-encoded UTF-8 and %-encoded
> original encoding, respectively).
>
> Chrome 4 returns the wire format in hostname, pathname and search. The
> wire format for hostname is Punycode.
>
> If we decide that the spec should say that hostname, pathname and
> search must return Unicode, then Issue 2 would be for specifying the
> wire format (Punycode in host, %-encoded UTF-8 (or original) in path,
> and %-encoded original in query).
>
> Erik
>
> On Fri, Apr 9, 2010 at 2:10 AM, Ian Hickson <ian@hixie.ch> wrote:
>> On Fri, 9 Apr 2010, "Martin J. Dürst" wrote:
>>> >
>>> > Issue 1:
>>> > ========================================================================
>>> > Update the IRI specification to define an algorithm with the following
>>> > characteristics:
>>>
>>> In order to make it easier to understand this for people who are not deeply
>>> involved in the HTML5 effort, I'd like to confirm that this is the algorithm
>>> that HTML5 uses to split an URI/IRI into various components, each of which is
>>> then accessible via a (Javascript) DOM API function. So I guess the title of
>>> our issue should be something like:
>>> "Ensure that the IRI spec defines how to split an IRI into components in a way
>>> that's referencable by the HTML5 spec" or some such.
>>
>> Right.
>>
>>
>>> > Exactly what this algorithm must do is a matter that will need careful
>>> > research, reverse-engineering existing UAs.
>>>
>>> My understanding was that a lot of this research had already been done,
>>> and that we would basically try to match whatever was in the HTML5 spec
>>> before Dan Connolly and Michael Sperberg-McQueen extracted it into a
>>> separate draft. Of course, we should always be open to new information
>>> coming up, but your sentence above sounds much more like we have to
>>> start anew. Can you clarify?
>>
>> Since the text was written, so many problems have been shown to exist with
>> the existing text that frankly I think it would be significantly less work
>> to just start over and reverse-engineer the algorithms from scratch than
>> to try to first attempt to match what HTML5 used to say and then verify it
>> for correctness.
>>
>> (Personally, if the working groups were to decide that HTML5 is where
>> these algorithms should be, I'd probably just throw out the old text and
>> start again from nothing, working closely with the relevant engineers at
>> the various major browser vendors to check what they consider important
>> and what don't, trying to reconcile the various behaviours with each
>> other, with legacy content requiremnts, and with the intent of the URI and
>> IRI specs. I almost certainly wouldn't start from the old algorithms.)
>>
>>
>>> > Issue 2:
>>> > ========================================================================
>>> > Update the IRI specification to define an algorithm with the following
>>> > characteristics:
>>>
>>> Again to clarify here, if I understand correctly, the HTML5 spec needs
>>> such an algorithm to resolve relative references with respect to a base
>>> URI
>>
>> Right. This algorithm is used for resolving URLs relative to a base URL,
>> and also to convert URLs into a more canonical (if not always valid) form.
>>
>>
>>> (my wild guess is that B is the base, and A is the relative URI below,
>>> can you confirm)?
>>
>> Right. Of course, A need not be relative, it could be itself an absolute
>> URL, or it could be something unparseable.
>>
>> HTH,
>> --
>> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
>> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
>> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>

Received on Friday, 9 April 2010 16:23:55 UTC