- From: Adam Barth <ietf@adambarth.com>
- Date: Sat, 4 Sep 2010 13:36:44 -0700
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: public-iri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>
Bjoern, your email reads as angry. Hope that's not the case... On Sat, Sep 4, 2010 at 12:42 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote: > * Adam Barth wrote: >>The way browsers process URLs is largely constrained by compatibility >>with existing web content. You might find some of the things they do >>gross and disgusting, but editorializing about the relative merits of >>that behavior is not particularly helpful at this time. > > Editorializing your thoughts on this working group and other people > editorializing is perhaps not the best approach if your goal is less > editorializing -- as most people find it difficult to resist trolls. I'm not trying to troll this working group. I think it's important to specify this stuff. I just don't want to be drawn into a protracted discussion of whether or not it's a good idea to specify these particular definition at these particular levels of detail. >>If you believe the document is inaccurate, your feedback will be more >>influential if you provide an example URL and an example browser which >>you believe behaves differently than what the document describes. > > The document does not describe behavior that could be observed through > black box testing, so what you ask is not possible. The easiest thing to observe via black-box testing is the composition of the parsing, resolving, and canonicalization algorithms. This document contains only the parsing algorithm, which might be difficult to disentangle from other other two, at least without some intuition for what the other two algorithms are doing. Once we've specified all three concepts, you'll have a more complete picture. The easiest way to observe how browsers process URL is to use a hyperlink: <a href="..."></a> You can see a number of tests of that form here: http://trac.webkit.org/browser/trunk/LayoutTests/fast/url In particular, this test shows how you can see which parts of the string get parsed into which components: http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js Note that this API treats control characters (e.g., ":") slightly differently than the document I sent, but the approach I've chosen seems like it will be more convenient for the other two algorithms. > You should define > the testing methodology so reviewers would have a reference, and more > importantly, what exactly the input to your algorithm is and how it is > obtained. I think we'll get a higher-quality result if different folks use different methodologies so we're not blinded by errors in one methodology. However, since you asked, here's the methodology I'm using. First, I translated the unit tests for the GURL URL parsing library to HTML documents that can be run in any browser. The translated tests can be found here: http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/ I then collated the results for a number of browsers and tried to understand how to present a coherent model that explains the observable behavior. Folks have suggested a number of other test suites, which I'm working through in a similar way. > For instance, the first step in your algorithm is: > > Consume all leading and trailing control characters. > > That does not work for the values of attributes in HTML documents as > they may contain strings that represent relative resource identifiers. > So perhaps you are assuming absolute identifiers? I've started by trying to separate the concerns of parsing absolute URLs and resolving relative URLs. We might come to find that such a distinction is foolish, but it seems plausible at this time. We probably move that requirement to canonicalization, but it seemed easier to put it in parsing. > The next steps are: > > If the remaining string does not contain a ":" character: > -> The URL is invalid. > -> Abort these steps. > > Well that would make no sense if you assume an absolute identifier: > they contain a colon by definition. Parsing is defined for all strings. There exist strings that do not contain ":" characters. Therefore, the definition of parsing needs to explain what to do with them. In this case, it claims the URL is "invalid", although we haven't yet said what that means. > This could be meant as a test for > relative references, but then the next step is: > > Consume characters up to, but not including, the first ":" > character. These characters are the /scheme/. > > This would leave, say, "#:" as absolute reference with a scheme of > "#", as it contains a colon and "#" is the part before the first ":" > (similarily, ":" would be one with the empty string as scheme). We have not yet defined how to resolve relative URLs. The parsing definition, at least so far, is a definition of how to parse absolute URLs. If you were asked to regard the string "#:" as an absoute URL, it seems like treating "#" as the scheme would be one reasonable interpretation. I haven't thought through canonicalization yet, but I suspect testing will reveal that "#" is not a valid character for a scheme. >>At this point, I'm not accepting editorial feedback on this document. >>There's a mountain of editorial work to do, but I'd like to get the >>nuts and bolts down first. In particular, discussion of whether to >>present the requirements in terms of an algorithm or a set of >>declarative rules is not particularly helpful at this time. > > I can understand that you do not wish to receive feedback for saying > "Replace backslashes by slashes, split into components as defined in > RFC 3986 Appendix B, and if the authority contains more than one '@' > treat all but the last ones as if they had been percent-encoded" in > more than a hundred lines of prose algorithms that don't appear to be > particularily correct. I'm not sure what you mean by "don't appear to be particularly correct." Is there a specific input that you believe is not handled correctly? You've mentioned that parsing relative URLs doesn't give sensible results. I should have communicated more clearly that I haven't dealt with relative URLs yet. As for the parsing definition in RFC 3986 Appendix B, is this the regular expression that you're referring to? ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? This doesn't appear to get even simple examples correct. For example, that regular expression doesn't produce a match for the following string, but browsers do, in fact, behave as if this string represents a particular URL: http:///example.com/ Kind regards, Adam
Received on Saturday, 4 September 2010 20:37:51 UTC