Re: Clarifying the URL Standard goals from Martin J. Dürst on 2012-11-08 (uri@w3.org from November 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 08 Nov 2012 19:11:58 +0900
To: Anne van Kesteren <annevk@annevk.nl>
CC: uri@w3.org, public-iri@w3.org
Message-ID: <509B856E.7020304@it.aoyama.ac.jp>
Hello Anne,

Many thanks for your email.

[somewhat reordered]

 > (I am not sure if it is worth mentioning, but I am doing this work
 > largely by myself and thus far unpaid. Although browser vendors have
 > indicated they appreciate what I am doing,

Not only is it worth mentioning, I suspect it may also show a 
symptomatic problem in the area of URI/IRI/URLs: Everybody (including of 
course the browser vendors) thinks it's important, but nobody feels it's 
important enough to spend valuable employee time. Or is it pure 
coincidence that you have time for this work now that you are not 
employed? (Sorry if I'm reading too much into the situation.)

I would also want to take this occasion to thank you again for your 
work. This is something I wish had been happening two or three years 
ago, but we can't change that anymore.


 > I am not representing any
 > of them, and I definitely care about software other than browsers. My
 > experience is that what browsers do leaks throughout the ecosystem and
 > I rather have it documented and confined what is leaked than everyone
 > having to reverse engineer each other in a race to the bottom.)

Even if it's only for the browsers, documenting and confining makes a 
lot of sense.


On 2012/11/08 6:51, Anne van Kesteren wrote:
> I listened to the audio recording of the meeting and I feel my emails
> have been largely misunderstood. I thought I would try again.
>
> As far as the interests of the IETF seem to go, this is what
> http://url.spec.whatwg.org/ attempts to do:
>
> * Define the string syntax for URLs (IRI-references, if you wish).

My understanding is that this is a "procedural" definition, i.e. one 
finds out whether something is an URL by running through a certain 
number of steps (written in pseudocode in the spec).

I think this is something good to have. But some people want a more 
top-down description, for which the syntax in RFC 3986/3987/3987bis 
should be more suited.

So it would be good to do the following:
- Make sure that URL syntax and IRI-reference syntax are aligned.
   (we started working on that)
- Verify that both descriptions are indeed the same
   (this is somewhat more formal than "make sure", but doesn't
    have to be done for every small change)
- Make it explicit that these two are the same by cross-references


> * Define parsing the string syntax into a model (IRI-references ->
> URI, if you wish).

If you say "model", it looks more like "URI pieces" than just "URI".
Also, if this is about the DOM, and you say "URI", does this mean that 
when accessing URL parts in the DOM, these (e.g. the path part) are all 
%-escaped? Or are Unicode characters preserved in the DOM? (the later 
would be much better for many usages)

> * Define serializing the model back into a string syntax.
>
>
> The string syntax seems non-controversial.

Yes, we just have to agree on some details, and on what descriptions of 
the syntax we provide (and where).

> The parsing seems controversial, but the plan is to add an option
> there to only parse conforming string syntax and bail on the first
> error. (Going past the first error is useful for e.g. browsers /
> search engines / wget so they can interoperate with content, but also
> for URL validators that want to highlight more than one error in the
> URL.)

You talk about errors here. Are these strict validity errors? My 
impression was that at least in an earlier version of your spec, there 
was a distinction between "not valid" and something I might call here 
"absolutely hopeless". The former might include spaces and e.g. "\". I'm 
not sure about the later, but maybe something like "www......com" would 
be an example.

Do you still have this (essentially three-level) distinction, or did I 
get that wrong?


> The model is currently only documented as a function of the parsing
> algorithm. My plan was to align implementations on parsing first
> before documenting what model that implied. I already indicated how
> this model is incompatible with URI. E.g. a query with a lone "%" can
> be the output of the parser, but is definitely not an URI. (This is
> why "fixup" is the wrong way to look at this algorithm I think.)

What do you mean by "a query with a lone "%" can be the output of the 
parser"? Does that mean that it is sent as such to the server? What do 
servers do with it? Would it hurt to escape it to %25? Or is that done 
in a later stage?


> Serializing seems non-controversial.

One thing I'm worried is the dependency of the query part on the 
document encoding. In the (very) long term, the Web seems to converge on 
using UTF-8. You are a very vocal proponent of that direction, and I of 
course fully agree. But if we stay with the current spec, we may end up 
that we will have to write http://www.google.com?q=%E6%97%A5%E6%9C%AC 
forever rather than the much more readable (for those who care) 
http://www.google.com?q=日��.


Regards,    Martin.
Received on Thursday, 8 November 2012 10:12:40 UTC