Re: Clarifying the URL Standard goals

On Thu, Nov 8, 2012 at 11:11 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> Or is it pure coincidence that you
> have time for this work now that you are not employed? (Sorry if I'm reading
> too much into the situation.)

Yeah, coincidence. In fact, I started work in the month before leaving
Opera after hearing about yet another bug report in the URL layer and
realising we did not come far with URLs since 2008.


> On 2012/11/08 6:51, Anne van Kesteren wrote:
>> As far as the interests of the IETF seem to go, this is what
>> http://url.spec.whatwg.org/ attempts to do:
>>
>> * Define the string syntax for URLs (IRI-references, if you wish).
>
> My understanding is that this is a "procedural" definition, i.e. one finds
> out whether something is an URL by running through a certain number of steps
> (written in pseudocode in the spec).

I do not think it is procedural, but it does not use ABNF:
http://url.spec.whatwg.org/#writing It uses a style we have been using
in WHATWG/W3C land to define how to write things. For
parsing/processing we then use a procedural definition.


> So it would be good to do the following:
> - Make sure that URL syntax and IRI-reference syntax are aligned.
>   (we started working on that)
> - Verify that both descriptions are indeed the same
>   (this is somewhat more formal than "make sure", but doesn't
>    have to be done for every small change)
> - Make it explicit that these two are the same by cross-references

I think our main problem will be the concept of "relative schemes" the
URL Standard has. Only if the base URL has such a scheme a relative
reference can be resolved successfully. I'm not really sure how to
bring that any closer to what STD 66 expects to happen. E.g.

base: customscheme://test
input: ?test

results in failure rather than customscheme://test/?test.


> If you say "model", it looks more like "URI pieces" than just "URI".

I thought that is the way Roy described what is defined in STD 66.


> Also, if this is about the DOM, and you say "URI", does this mean that when
> accessing URL parts in the DOM, these (e.g. the path part) are all
> %-escaped? Or are Unicode characters preserved in the DOM? (the later would
> be much better for many usages)

They are not preserved.


> You talk about errors here. Are these strict validity errors? My impression
> was that at least in an earlier version of your spec, there was a
> distinction between "not valid" and something I might call here "absolutely
> hopeless". The former might include spaces and e.g. "\". I'm not sure about
> the later, but maybe something like "www......com" would be an example.
>
> Do you still have this (essentially three-level) distinction, or did I get
> that wrong?

You got it right.

I replaced the "invalid flag" with a "fatal error flag" and plan to
introduce the concept of errors. Fatal errors are hopeless (e.g. input
does not have a scheme and there's no base either), errors are about
not matching the string syntax.


> What do you mean by "a query with a lone "%" can be the output of the
> parser"? Does that mean that it is sent as such to the server? What do
> servers do with it? Would it hurt to escape it to %25? Or is that done in a
> later stage?

Servers appear to not pay much attention to it. http://www.w3.org/?%
is an example. Escaping it to %25 seems dangerous compatibility wise.


>> Serializing seems non-controversial.
>
> One thing I'm worried is the dependency of the query part on the document
> encoding. In the (very) long term, the Web seems to converge on using UTF-8.
> You are a very vocal proponent of that direction, and I of course fully
> agree. But if we stay with the current spec, we may end up that we will have
> to write http://www.google.com?q=%E6%97%A5%E6%9C%AC forever rather than the
> much more readable (for those who care) http://www.google.com?q=日本.

Well, we can write the latter in documents encoded as either utf-8 or
utf-16 (but please don't encode your documents as utf-16), because we
know the www.google.com endpoint speaks utf-8. It will still be parsed
into percent-encoded bytes which is somewhat useless and wasteful, but
not a huge problem if we provide API surface to get utf-8 out of it
again.

Whether we want to improve that situation even more is kinda tricky.
On the one hand yes, because it would obviously be cleaner to just
transmit utf-8 rather than the ugly percent-encoded bytes, but on the
other hand a lot of existing infrastructure would have to change. So
much so that it is not entirely clear to me the ROI is worth it.


-- 
http://annevankesteren.nl/

Received on Thursday, 8 November 2012 11:52:47 UTC