Re: Change definition of URL to normatively reference IRI specification using a well-defined interface

On Tue, 6 Apr 2010, Julian Reschke wrote:
> > 
> > DETAILS
> > Update the IRI specification to define two algorithms:
> > 
> >   * parsing an address (relative or absolute): algorithm to obtain a
> >     failure/success condition (not the same as whether the input is
> >     valid or not, just whether it can be parsed), and the following
> >     components, from parsing an arbitrary string:
> >      -<scheme>  component
> >      -<host>  component
> >      -<port>  component
> >      -<hostport>  component
> >      -<path>  component
> >      -<query>  component
> >      -<fragment>  component
> >      -<host-specific>  component
> 
> 1) I believe you want that algorithm to parse and return the individual
> components even for invalid IRIs, right? If so, this should be pointed out.

The parenthetical points this out. Either way, I assume Larry and Martin 
are aware of this requirement, since otherwise there'd be no point in this 
exercise (it's basically the only change needed to the IRI specs).


> 2) Why would IRIbis need to define <hostport>?

It is useful for defining HTML's APIs. The idea here is to extract the 
parsing rules from the HTML spec.


> 3) Similarly, why would IRIbis defined <host-specific>? This one doesn't 
> seem to be used at all.

It's used by the postMessage draft. (Missing this kind of thing is the 
danger of splitting the HTML5 spec. I highly recommend using the 
complete.html version of the spec when searching for impact of things like 
this on the Web platform. I'd like to merge other specs like XHR into that 
document too; the main reason that hasn't been done yet is that the W3C 
spec copyright license is incompatible with reuse of that nature.)


> >   * resolving an address A relative to a base address B with an encoding C:
> >     algorithm for parsing an arbitrary string A and resolving it relative
> >     to address B (which will have been resolved, but may be invalid), using
> >     a specified character encoding C, and returning either success or
> >     failure, and in the case of success, a string, with the following
> >     conditions:
> >      - the output of the algorithm must be idempotent even if the base
> >        argument is changed (i.e. once resolved, resolving it again with
> >        the same character encoding cannot change the result)
> 
> I don't believe "idempotent" is the right term here, if you do a second
> invocation with different arguments. Please elaborate, maybe give an example?

"http://example.com##" is absolute, because regardless of the "B" 
argument, the output is the same.


> >      - resolving preserves errors, e.g. resolving "http://example.com##"
> >        returns "http://example.com/##" not "http://example.com/#%C3".
> > 
> > Update the HTML spec to use these algorithms and reference the IRI 
> > spec that defines them.
> 
> It would be cool to understand why this is a requirement (I'm ready to 
> believe it is in practice, I'd just like to see the reason...).

The goal is consistency with shipped UAs. Whatever is consistent with UAs 
is what we should do. I presume Larry and Martin will be be doing 
extensive testing to be consistent with shipped UAs, and will respond to 
UA feedback to be consistent with whatever they're willing to implement.


One thing I was thinking about last night is that it might be useful to 
split the "resolve an address" algorithm into two, one to resolve an 
address to ASCII output, and one to resolve an address to full-Unicode 
output. We need the ASCII-only version so that we can extract the path for 
use with e.g. HTTP, which doesn't support Unicode paths natively. I 
haven't checked the specs I edit to see what else gets affected by this.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 6 April 2010 17:49:04 UTC