Re: Progress on URL spec from Bjoern Hoehrmann on 2010-09-04 (public-iri@w3.org from September 2010)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 05 Sep 2010 01:28:37 +0200
To: Adam Barth <ietf@adambarth.com>
Cc: public-iri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>
Message-ID: <jcg5865msf7v3ph9ul27eoqkvbputa9bec@hive.bjoern.hoehrmann.de>

* Adam Barth wrote:
>The easiest thing to observe via black-box testing is the composition
>of the parsing, resolving, and canonicalization algorithms.  This
>document contains only the parsing algorithm, which might be difficult
>to disentangle from other other two, at least without some intuition
>for what the other two algorithms are doing.  Once we've specified all
>three concepts, you'll have a more complete picture.

I was not asking for my benefit, I was simply pointing out that without
a well-defined method how people can check for themselves whether your
document should be changed by your standards, they are a lot more likely
to assume problems they see are intentional and not bring them to your
attention.

>I've started by trying to separate the concerns of parsing absolute
>URLs and resolving relative URLs.  We might come to find that such a
>distinction is foolish, but it seems plausible at this time.

I don't think there is anything plausible about defining how to parse
an absolute reference that contains no colon and thus isn't absolute,
much like it is not plausible to define that the scheme in "#:" is "#".

>As for the parsing definition in RFC 3986 Appendix B, is this the
>regular expression that you're referring to?
>
>      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
>
>This doesn't appear to get even simple examples correct.  For example,
>that regular expression doesn't produce a match for the following
>string, but browsers do, in fact, behave as if this string represents
>a particular URL:
>
>http:///example.com/

That's a perfectly valid reference per the generic syntax and it has a
scheme of 'http', undefined query and fragment parts, an empty authority
and a path of '/example.com/' as mandated by RFC 3986 and as the regular
expression matches [1]. Neither IE6 nor Opera will treat the string as
if the third slash had been omitted; if any browser does, that is a bug.
That's one reason for my remark about the correctness of your algorithm.

[1] As the specification notes, the expression matches all strings: the
    first two (outermost) captures are optional, the third matches any-
    thing but "?" and "#", then comes (optional) "?" followed by any-
    thing but "#" and then comes (optional) "#" followed by any string.

      % perl -E "say 'Path: ' . $5 if 'http:///example.com/' =~
          m~^(([^:/?#]+):)?(//([^/? #]*))?([^?#]*)(\?([^#]*))?(#(.*))?~"
        Path: /example.com/
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Saturday, 4 September 2010 23:29:18 UTC