Re: Progress on URL spec from Adam Barth on 2010-09-04 (public-iri@w3.org from September 2010)

From: Adam Barth <ietf@adambarth.com>
Date: Sat, 4 Sep 2010 13:36:44 -0700
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-iri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>
Message-ID: <AANLkTi=W1B8Vv9nGJ70g46QNmwGfhoK59R2J0VP6Xk4L@mail.gmail.com>
Bjoern, your email reads as angry.  Hope that's not the case...

On Sat, Sep 4, 2010 at 12:42 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
> * Adam Barth wrote:
>>The way browsers process URLs is largely constrained by compatibility
>>with existing web content.  You might find some of the things they do
>>gross and disgusting, but editorializing about the relative merits of
>>that behavior is not particularly helpful at this time.
>
> Editorializing your thoughts on this working group and other people
> editorializing is perhaps not the best approach if your goal is less
> editorializing -- as most people find it difficult to resist trolls.

I'm not trying to troll this working group.  I think it's important to
specify this stuff.  I just don't want to be drawn into a protracted
discussion of whether or not it's a good idea to specify these
particular definition at these particular levels of detail.

>>If you believe the document is inaccurate, your feedback will be more
>>influential if you provide an example URL and an example browser which
>>you believe behaves differently than what the document describes.
>
> The document does not describe behavior that could be observed through
> black box testing, so what you ask is not possible.

The easiest thing to observe via black-box testing is the composition
of the parsing, resolving, and canonicalization algorithms.  This
document contains only the parsing algorithm, which might be difficult
to disentangle from other other two, at least without some intuition
for what the other two algorithms are doing.  Once we've specified all
three concepts, you'll have a more complete picture.

The easiest way to observe how browsers process URL is to use a hyperlink:

<a href="..."></a>

You can see a number of tests of that form here:

http://trac.webkit.org/browser/trunk/LayoutTests/fast/url

In particular, this test shows how you can see which parts of the
string get parsed into which components:

http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js

Note that this API treats control characters (e.g., ":") slightly
differently than the document I sent, but the approach I've chosen
seems like it will be more convenient for the other two algorithms.

> You should define
> the testing methodology so reviewers would have a reference, and more
> importantly, what exactly the input to your algorithm is and how it is
> obtained.

I think we'll get a higher-quality result if different folks use
different methodologies so we're not blinded by errors in one
methodology.  However, since you asked, here's the methodology I'm
using.  First, I translated the unit tests for the GURL URL parsing
library to HTML documents that can be run in any browser.  The
translated tests can be found here:

http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/

I then collated the results for a number of browsers and tried to
understand how to present a coherent model that explains the
observable behavior.  Folks have suggested a number of other test
suites, which I'm working through in a similar way.

> For instance, the first step in your algorithm is:
>
>  Consume all leading and trailing control characters.
>
> That does not work for the values of attributes in HTML documents as
> they may contain strings that represent relative resource identifiers.
> So perhaps you are assuming absolute identifiers?

I've started by trying to separate the concerns of parsing absolute
URLs and resolving relative URLs.  We might come to find that such a
distinction is foolish, but it seems plausible at this time.  We
probably move that requirement to canonicalization, but it seemed
easier to put it in parsing.

> The next steps are:
>
>  If the remaining string does not contain a ":" character:
>    -> The URL is invalid.
>    -> Abort these steps.
>
> Well that would make no sense if you assume an absolute identifier:
> they contain a colon by definition.

Parsing is defined for all strings.  There exist strings that do not
contain ":" characters.  Therefore, the definition of parsing needs to
explain what to do with them.  In this case, it claims the URL is
"invalid", although we haven't yet said what that means.

> This could be meant as a test for
> relative references, but then the next step is:
>
>  Consume characters up to, but not including, the first ":"
>  character. These characters are the /scheme/.
>
> This would leave, say, "#:" as absolute reference with a scheme of
> "#", as it contains a colon and "#" is the part before the first ":"
> (similarily, ":" would be one with the empty string as scheme).

We have not yet defined how to resolve relative URLs.  The parsing
definition, at least so far, is a definition of how to parse absolute
URLs.  If you were asked to regard the string "#:" as an absoute URL,
it seems like treating "#" as the scheme would be one reasonable
interpretation.  I haven't thought through canonicalization yet, but I
suspect testing will reveal that "#" is not a valid character for a
scheme.

>>At this point, I'm not accepting editorial feedback on this document.
>>There's a mountain of editorial work to do, but I'd like to get the
>>nuts and bolts down first.  In particular, discussion of whether to
>>present the requirements in terms of an algorithm or a set of
>>declarative rules is not particularly helpful at this time.
>
> I can understand that you do not wish to receive feedback for saying
> "Replace backslashes by slashes, split into components as defined in
> RFC 3986 Appendix B, and if the authority contains more than one '@'
> treat all but the last ones as if they had been percent-encoded" in
> more than a hundred lines of prose algorithms that don't appear to be
> particularily correct.

I'm not sure what you mean by "don't appear to be particularly
correct."  Is there a specific input that you believe is not handled
correctly?  You've mentioned that parsing relative URLs doesn't give
sensible results.  I should have communicated more clearly that I
haven't dealt with relative URLs yet.

As for the parsing definition in RFC 3986 Appendix B, is this the
regular expression that you're referring to?

      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

This doesn't appear to get even simple examples correct.  For example,
that regular expression doesn't produce a match for the following
string, but browsers do, in fact, behave as if this string represents
a particular URL:

http:///example.com/

Kind regards,
Adam
Received on Saturday, 4 September 2010 20:37:51 UTC