W3C home > Mailing lists > Public > uri@w3.org > November 2012

Re: Clarifying the URL Standard goals

From: Anne van Kesteren <annevk@annevk.nl>
Date: Thu, 8 Nov 2012 12:52:14 +0100
Message-ID: <CADnb78iX0b4+TDBU-dCnpsu6zw7F8Tiq+Cp3QbF0nWLtH9_aVw@mail.gmail.com>
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: uri@w3.org, public-iri@w3.org
On Thu, Nov 8, 2012 at 11:11 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> Or is it pure coincidence that you
> have time for this work now that you are not employed? (Sorry if I'm reading
> too much into the situation.)

Yeah, coincidence. In fact, I started work in the month before leaving
Opera after hearing about yet another bug report in the URL layer and
realising we did not come far with URLs since 2008.

> On 2012/11/08 6:51, Anne van Kesteren wrote:
>> As far as the interests of the IETF seem to go, this is what
>> http://url.spec.whatwg.org/ attempts to do:
>> * Define the string syntax for URLs (IRI-references, if you wish).
> My understanding is that this is a "procedural" definition, i.e. one finds
> out whether something is an URL by running through a certain number of steps
> (written in pseudocode in the spec).

I do not think it is procedural, but it does not use ABNF:
http://url.spec.whatwg.org/#writing It uses a style we have been using
in WHATWG/W3C land to define how to write things. For
parsing/processing we then use a procedural definition.

> So it would be good to do the following:
> - Make sure that URL syntax and IRI-reference syntax are aligned.
>   (we started working on that)
> - Verify that both descriptions are indeed the same
>   (this is somewhat more formal than "make sure", but doesn't
>    have to be done for every small change)
> - Make it explicit that these two are the same by cross-references

I think our main problem will be the concept of "relative schemes" the
URL Standard has. Only if the base URL has such a scheme a relative
reference can be resolved successfully. I'm not really sure how to
bring that any closer to what STD 66 expects to happen. E.g.

base: customscheme://test
input: ?test

results in failure rather than customscheme://test/?test.

> If you say "model", it looks more like "URI pieces" than just "URI".

I thought that is the way Roy described what is defined in STD 66.

> Also, if this is about the DOM, and you say "URI", does this mean that when
> accessing URL parts in the DOM, these (e.g. the path part) are all
> %-escaped? Or are Unicode characters preserved in the DOM? (the later would
> be much better for many usages)

They are not preserved.

> You talk about errors here. Are these strict validity errors? My impression
> was that at least in an earlier version of your spec, there was a
> distinction between "not valid" and something I might call here "absolutely
> hopeless". The former might include spaces and e.g. "\". I'm not sure about
> the later, but maybe something like "www......com" would be an example.
> Do you still have this (essentially three-level) distinction, or did I get
> that wrong?

You got it right.

I replaced the "invalid flag" with a "fatal error flag" and plan to
introduce the concept of errors. Fatal errors are hopeless (e.g. input
does not have a scheme and there's no base either), errors are about
not matching the string syntax.

> What do you mean by "a query with a lone "%" can be the output of the
> parser"? Does that mean that it is sent as such to the server? What do
> servers do with it? Would it hurt to escape it to %25? Or is that done in a
> later stage?

Servers appear to not pay much attention to it. http://www.w3.org/?%
is an example. Escaping it to %25 seems dangerous compatibility wise.

>> Serializing seems non-controversial.
> One thing I'm worried is the dependency of the query part on the document
> encoding. In the (very) long term, the Web seems to converge on using UTF-8.
> You are a very vocal proponent of that direction, and I of course fully
> agree. But if we stay with the current spec, we may end up that we will have
> to write http://www.google.com?q=%E6%97%A5%E6%9C%AC forever rather than the
> much more readable (for those who care) http://www.google.com?q=日.

Well, we can write the latter in documents encoded as either utf-8 or
utf-16 (but please don't encode your documents as utf-16), because we
know the www.google.com endpoint speaks utf-8. It will still be parsed
into percent-encoded bytes which is somewhat useless and wasteful, but
not a huge problem if we provide API surface to get utf-8 out of it

Whether we want to improve that situation even more is kinda tricky.
On the one hand yes, because it would obviously be cleaner to just
transmit utf-8 rather than the ugly percent-encoded bytes, but on the
other hand a lot of existing infrastructure would have to change. So
much so that it is not entirely clear to me the ROI is worth it.

Received on Thursday, 8 November 2012 11:52:47 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:16 UTC